Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Econometrics: Random Variables, Least Squares Estimation, and Goodness of Fit, Study notes of Introduction to Econometrics

Universiteit Maastricht Introduction to Econometrics

This document delves into the fundamental concepts of econometrics, focusing on random variables, their probability distributions, and their application in statistical modeling. It explores the concept of least squares estimation, a core technique for estimating parameters in linear regression models, and examines the use of projection matrices to understand the relationship between fitted values and residuals. The document also discusses measures of goodness of fit, such as the coefficient of determination, and the importance of correctly specifying the regression model to avoid over-specification and ensure unbiased estimates. It concludes with an overview of the central limit theorem and its application in hypothesis testing.

Typology: Study notes

2023/2024

Uploaded on 01/14/2025

basilio-nino-inferrera 🇳🇱

1 document

1 / 62

This page cannot be seen from the preview

Don't miss anything!

bg1

Applied Econometrics

13/09!

EXAM: 3 party = T/F | six multiple choices | six questions on a proposed empirical analysis !

* Test folder contains the mock of the exam. !

* Mid-term exam, just a simulation of the exam (26th at 8.30 mid term).!

* Only in December exam session we will have 4 diﬀerent possibilities for the exam.!

OPTION 1: the grade determined only by the test!

OPTION 2: the grade determined by 50% essay (until December 10th) 50% the final test!

OPTION 3: the grade determined by 2/5 midterm and 3/5 the final exam!

OPTION 4: the grade determined by 2/5 essay, 1/5 by the midterm and 2/5 the final test!

??? !

* The grade of the midterm will become zero if the grade taken is lower than the grade of the final

exam. !

CHAPTER 1

Chapter one is based on the notion of models. Models are just the possibility to study what

happens in the real word, in an easier way. We eliminate some aspects to have a simplified

version of the real word. A simple example is the consumption function, consumption depends on

disposable income, so it is an equation. This model predicts that if the disposable income goes

up, the consumption goes up. This is a model to explain the consumption of household. We can

consider model as virtual reality, and now it is very important because nowadays we use

computers. !

We formulate a theory and then it creates models, they can be based on mathematics or not.

Models are what we need when we want to explain some aspects of the real word. The theory is

that the consumption is based on income, the model is: !

!

*C = consumption!

* = constant!

* = variable !

*Y = disposable income!

*

Theory says that there is a positive relation relationship between the dependent variable C and the

independent variable y.!

Not all model are accepted, for example in this period the ECB rises the interest rate by 75 basis

points, during an interview this was a good decision, but a colleague argue that it was a bad

decision => models based on expectation for the professor works, but for the colleague not. Not

always a consensus.!

We create model to validate some phenomena observed in the real word, so we are expected to

be able to estimate a model. At the end of the course we have to be able to write 10 pages to

explain some financial phenomena, such as the capital asset model. !

Models in economics are based on random variable. A model can be defined as a set of data

generated processes (DGP). A DGP is unique virtual reality, so the model is a set of virtual reality. !

Going back to the example of the consumption equation, if I provide a specific value for each

variable, I create a unique virtual reality, a model is a set because I haven’t specified the values, so

it contains infinite possibilities. We may use model just for description or forecasting. !

Autoregressive model is this type of model do not provide any economic

explanation for a specific variable, but it is very useful to forecast for example the CPI (consumer

price index). There is no an economic causality between variables, so models not always provide

the economic explanation of the phenomenon analyzed.!

What to know: Theory - Model - DGP!

→

C=α+βy

α

β

yt=α+βyt−1+μ

1

When to communicate the decision? Can I decide to refuse a grade of the midterm or of

the essay?!

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

pf13

pf14

pf15

pf16

pf17

pf18

pf19

pf1a

pf1b

pf1c

pf1d

pf1e

pf1f

pf20

pf21

pf22

pf23

pf24

pf25

pf26

pf27

pf28

pf29

pf2a

pf2b

pf2c

pf2d

pf2e

pf2f

pf30

pf31

pf32

pf33

pf34

pf35

pf36

pf37

pf38

pf39

pf3a

pf3b

pf3c

pf3d

pf3e

Discover Study notes of Introduction to Econometrics Universiteit Maastricht

Related documents

Regression Analysis: Least Squares Method and Goodness of Fit

Stats -Least-Squares Estimation

Least-Squares Regression Analysis: Urbanization and Coliform Bacteria Concentration

Least-Squares Regression: Finding the Line of Best Fit

Least Squares Regression Method - Managerial Accounting - Lecture Slides

(1)

Least Squares Minimization and Matrix Differentiation

Orthogonal partial least squares discriminant analysis

4.3 Least Squares Approximations

Numerical Methods_Part 5_Curve Fitting and Least-Squares Regression

Machine Learning Algorithm: Least Squares Regression

Least Squares Regression: Estimating Population Parameters - Prof. David M. Aadland

Understanding Least-Squares Regression: Interpreting Regression Line and Results

Partial preview of the text

Download Econometrics: Random Variables, Least Squares Estimation, and Goodness of Fit and more Study notes Introduction to Econometrics in PDF only on Docsity!

Applied Econometrics

EXAM: 3 party = T/F | six multiple choices | six questions on a proposed empirical analysis

Test folder contains the mock of the exam.
Mid-term exam, just a simulation of the exam (26th at 8.30 mid term).
Only in December exam session we will have 4 different possibilities for the exam.

OPTION 1: the grade determined only by the test

OPTION 2: the grade determined by 50% essay (until December 10th) 50% the final test

OPTION 3: the grade determined by 2/5 midterm and 3/5 the final exam

OPTION 4: the grade determined by 2/5 essay, 1/5 by the midterm and 2/5 the final test

The grade of the midterm will become zero if the grade taken is lower than the grade of the final

exam.

CHAPTER 1

Chapter one is based on the notion of models. Models are just the possibility to study what

happens in the real word, in an easier way. We eliminate some aspects to have a simplified

version of the real word. A simple example is the consumption function, consumption depends on

disposable income, so it is an equation. This model predicts that if the disposable income goes

up, the consumption goes up. This is a model to explain the consumption of household. We can

consider model as virtual reality, and now it is very important because nowadays we use

computers.

We formulate a theory and then it creates models, they can be based on mathematics or not.

Models are what we need when we want to explain some aspects of the real word. The theory is

that the consumption is based on income, the model is:

C = consumption
= constant
(^) = variable
Y = disposable income

Theory says that there is a positive relation relationship between the dependent variable C and the

independent variable y.

Not all model are accepted, for example in this period the ECB rises the interest rate by 75 basis

points, during an interview this was a good decision, but a colleague argue that it was a bad

decision => models based on expectation for the professor works, but for the colleague not. Not

always a consensus.

We create model to validate some phenomena observed in the real word, so we are expected to

be able to estimate a model. At the end of the course we have to be able to write 10 pages to

explain some financial phenomena, such as the capital asset model.

Models in economics are based on random variable. A model can be defined as a set of data

generated processes (DGP). A DGP is unique virtual reality, so the model is a set of virtual reality.

Going back to the example of the consumption equation, if I provide a specific value for each

variable, I create a unique virtual reality, a model is a set because I haven’t specified the values, so

it contains infinite possibilities. We may use model just for description or forecasting.

Autoregressive model is this type of model do not provide any economic

explanation for a specific variable, but it is very useful to forecast for example the CPI (consumer

price index). There is no an economic causality between variables, so models not always provide

the economic explanation of the phenomenon analyzed.

What to know: Theory - Model - DGP

C = α + β y

y

t

= α + β y

t − 1

When to communicate the decision? Can I decide to refuse a grade of the midterm or of

the essay?

CHAPTER 2

Regression models form the core of the discipline of econometrics. Basically, the best method to

estimate a model is called ordinary least squares (metodo dei minimi quadrati). The most

elementary type of regression is called simple linear regression model. “Simple” because there is

just one explanatory variable.

= intercept or constant

= slope coefficient

On the right side of this equation we have just one variable ( X ). Then we have a subscript for three

of the four elements in the equation. Why? Subscript t is used to index the observations of a

sample. Basically, when we have an econometric model / you want to estimate a model, we need

to collect data. So, you have a sample size, which in this case will be denoted by n and then there

is the subscript t , which is an indication at which observation of the n observation we are

considering. The subscript t runs from 1 to n , and always natural number.

Each observation comprises an observation on a dependent variable, which is the variable which

appears on the left side ( y ) and an observation on a single explanatory / independent variable ( ).

Therefore, this relation links the observation on a dependent variable and the explanatory variable,

for each observation in terms of two unknown parameters ( and ) and an observed

disturbance term or error term ( ). Error term is not the best name, because there is not an error

at all introducing u in this equation.

Of the five quantities in the equation, two are observed ( y and x ) and three are not ( , and ).

Moreover, three ( y, x and ) are specific to observation t, while and are common to all the n

observations, so in some sense they are constant.

Now we can move to the example of the consumption function: ( I = disposable

income). Suppose that t indicates the time, for example a year => yearly data. Then we have

that is the household consumption measures in year t and is the measure disposable income of

household in the same year. Like that the equation represents a consumption function, and it is a

model.

= marginal propensity to consume

= autonomous consumption

The purpose of this model is to try to explain the observed values of the dependent variable -

consumption - in terms of the explanatory variable - disposable income. According to the

equation, the value of is given for each t by a linear function. Now, the first word we try to

explain is linear , because the relationship here is linear. Linearity is in terms of y and the

coefficients.

The linear function, which in this case is , is called the regression function. A

regression function is the deterministic part of the equation, meaning that the disturbance term is

always a random variable ( ), while the remaining part is deterministic. So, deterministic means

that we are able to precisely determine its value.

In all cases, though, it is assumed that is a random variable. Most commonly, it is assumed that,

whatever the value of , the expectation of the random variable is zero. This assumption usually

serves to identify the unknown parameters and , in the sense that, under the assumption,

equation (2.01) can be true only for specific values of those parameters.

The random variable, called disturbance term, is used to not have a deterministic relationship for

example between consumption and disposable income, and we are not able to find precisely a

deterministic relationship because we have economic data, which are random, so we need to

introduce some randomness.

y

t

1

2

x

t

+ u

t

1

2

x

t

β 1

β 2

u t

β 1

β 2

u t

u t

β 1

β 2

C

t

1

2

I

t

+ u

t

y t

x t

2

1

y t

β 1

β 2

x t

u t

u t

u t

x t

u t

β 1

β 2

values taken on the random variables, we are interested in a subset of all the possible values,

such subset is called EVENT. We attach probability to such event.

Imagine to have a discrete random variable (≠ continuous random variables), meaning that I have

finite or countably infinite set of possible values taken on our random variable x1, x2, x3… I attach

probability to every element x => p. The sum of all these p is equal to 1.

Applying the regression analysis sometimes you have discrete random variables, but generally are

counted data.

Another possibility is that x can be a continuous variable and the dependent variable in a

regression model is normally continuous. When we have continuous random variables we always

need to attach probability => for any random variable we have to attach probability to EVENTS for

the range of possible values taken by random variables. The probability distribution for a

continuous random variable can be represented by a commutative distribution function (or CDF).

In general this function is represented by F and describes the probability that the random variable

X takes values smaller or equal to a specific x.

X = generic random variable

x = is a specific value taken by the random variable

=> x is a realization of the random variable X

To define / attach probability distribution we normally need three different rules:

All probabilities lie between 0 and 1. We have a random variable, the random variable can

take on a range of possible values, in general we take a subset of all those values and we

attach probability to the events;

The null set is assigned probability 0, and the full set of possibilities is assigned probability 1;
The probability assigned to an event that is the union of two disjoint (independent) events is

the sum of the probabilities assigned to those disjoint events.

CDF take the value of zero if I move on the left-hand-side => if x goes to. If x is very small the

CDF up to that point must be very small (tends to zero). This follows because the event (X ≤ x)

tends to the null set as x → , and the null set has probability 0. By similar reasoning, F(x) tends

to 1 when x → , because then the event (X ≤ x) tends to the entire real line. Further, F(x) must

be a weakly increasing function of x.

For a continuous r.v., the CDF assigns probabilities to every interval on the real line. However, if

we try to assign a probability to a single point, the result is always just zero. Suppose that X is a

scalar random variable with CDF F(x).

PDF = Probability Density Function is the derivative of the cumulative distribution function (not

always exists).

It becomes = 1 and = 0

Probabilities can be computed in terms of the density as well as the CDF =>

This is the general definition, from that we try to define some important distributions. The most

common example of a continuous distribution is provided by the normal distribution. This is the

distribution that generates the famous or infamous “bell curve” sometimes thought to influence

students’ grade distributions. The fundamental member of the normal family of distributions is the

standard normal distribution (standard normal = when we set the expected value to zero and the

standard deviation to 1). It is a continuous scalar distribution, defined on the entire real line. The

density of the standard normal distribution is often denoted (·).

f ( x ) = F ′( x )

+∞

−∞

f ( x ) d x =

+∞

−∞

F ′( x ) d x = F (+∞) − F (−∞) = 1

F (+∞) F (−∞)

Pr ( a ≤ X ≤ b ) = F ( b ) − F ( a ) =

b

a

f ( x ) d x

Its explicit expression:

Unlike (·) (PDF), the CDF, usually denoted (·), has no elementary closed-form expression:

Which is the probability to have a positive number using a standard normal distribution looking at

the following figure? 0.5 => as we have to consider the level 0.

While the expected value of this distribution is exactly zero ( ***** the normal distribution is symmetric

=> same are on the left-hand-side and the right-hand-side)

The are behind the line is equal to 1 => the integral of the under the red line the result is one.

What is the probability to have a number larger than 10? Almost 1

CHIEDERE L’ESEMPIO SUL GRAFICO PDF DEI COME SAPERE LA PROBABILITà DEL

PUNTO +10 E -

ϕ ( x ) = ( 2 π )

−

1

2 exp (−

x

2

Φ( x ) =

x

−∞

ϕ ( y ) d y

The expectation of a random variable is often referred to as its first moment. The so-called higher

moments, if they exist, are the expectations of the r.v. raised to a power. Thus the second moment

of a random variable X is the expectation of , the third moment is the expectation of , and so

on. In general, the moment of a continuous random variable X is:

=> moment of X is the integral between and , of x to the power of k times f(x)

The moments depend on the distribution rather than a specific random variable. For this reason,

we often speak of the moments of the distribution rather than the moments of a specific random

variable. These moments are uncentered.

Central moments is when we subtract the expectation from the computation of the expected

value:

Where. For a discrete X , the central moment is:

The most important central moment is the second central moment and it is called variance.

Frequently it is reported as and is used as a common notation to refer to it. The

squared is useful to immediately see it can’t be a negative number.

The positive square root of the variance, σ, is called the standard deviation of the distribution.

Estimates of standard deviations are often referred to as standard errors, especially when the

random variable in question is a parameter estimator.

The random variable generator is in every computer, the number are generated as a uniform

distribution. Uniform distribution is a number between 0 and 1 (in general). Starting from the

uniform distribution, using some algorithms it is possible to get different distribution => they are

pseudo random variables as are determined in a deterministic way.

Multivariate Distribution => rather than having a random variable, we may have a vector valued

with random variables.. It can be thought of as several scalar random variables that have a single,

joint distribution. Vector random variables is a single random variable that has a single joint

distribution. For simplicity, we will focus on the case of bivariate random variables, where the

vector has two elements. A continuous, bivariate random variable ( , ) has a distribution

function:

Thus F( , ) is the joint probability that both and. For continuous variables, we

can take the joint distribution with respect to and , and we get the joint density.

A very important concept is statistical independence. Two variables are statistical independent if

we can factorize their joint density:

Marginal cumulative distribution function CDF CDF

The first factor here is the joint probability that X1 ≤ x1 and X2 ≤ ∞. Since the second inequality

imposes no constraint, this factor is just the probability that X1 ≤ x1.

When we have factorization means that we can compose the joint density in two marginal CDFs.

This means that the two variables are statistically independent.

X

2 X

3

K

th

m k

( X ) =

+∞

−∞

x

k f ( x ) d x

K −∞ +∞

μ k

= E ( X − E ( X ))

k

∫

+∞

−∞

( x − μ )

k f ( x ) d x

μ = E ( X ) k

th

μ k

= E ( X − E ( X ))

k

m

i = 1

p ( x i

)( x i

− μ )

k

Va r ( X ) σ

2

X

1

X

2

F ( x

1

, x

2

) = Pr (( X

1

≤ x

1

) ∩ ( X

2

≤ x

2

x 1

x 2

X

1

≤ x

1

X

2

≤ x

2

x 1

x 2

F ( x

1

, x

2

) = F ( x

1

, ∞) F (∞, x

2

X

1

X

2

Independence (no relationship at all) is stronger than 0 covariance, as 0 covariance means no

linear dependence, but it does not exclude an high dependence.

Conditional probabilities Suppose that A and B are any two events. Then the probability of

event A conditional on B, or given B, is denoted as Pr(A | B) and is defined implicitly by the

equation: Pr(A ∩ B) = Pr(B) Pr(A | B)

For this equation to make sense as a definition of Pr(A | B), it is necessary that Pr(B)≠0.

The idea behind is that if we know that the event B be has been realized, this knowledge can

provide information about theater also the event A has been realized.

=> conditional density, or conditional PDF, is defined as

Conditional expectations which your expectation for an event and how can your expectation

can change if you know that some other events happened. In this case the the unconditional

expectation becomes the conditional expectation.

The way we write the conditional expectations is E( ) => the expected value of conditional

on the event of.

In econometrics we create a linear regression model , which is a model to model the conditional

expectations of a variable. For example starting from the formula , we can try to

find the Expectation for consumption with respect to disposable income => E [ ]. With the

regression analysis we are trying to model the conditional expectation of the dependent variable

to respect to some explanatory variable.

We have some properties associated with the conditional expectations:

Law of Iterated Expectations:. We take the expectations of all these

conditional expectations and we got the unconditional expectation.

Another property of the conditional expectation is related to deterministic function. Any

deterministic function of a conditioning variable is its own conditional expectation. In other

words:. This means that if I know the expectation of realized and then I

compute the expectation of , but I already know that has been realized , the result must

be. If I have a range of possible values fro a random variable but I know that one of those

values has been realized => it makes no sense to compute the expectation, as I know the

answer.

Previously we proposed a definition of a DGP as something that can be simulated on a computer,

and that constitutes a unique recipe for simulation. This definition is fine for virtual reality, but,

despite some claims to the contrary, we do not think that we are living in a simulation!

f ( x 1

| x 2

f ( x 1

, x 2

f ( x 2

X

1

| x 2

X

1

x 2

y t

= β 1

β 2

x t

u t

y t

| x t

E ( E ( X

1

| X

2

)) = E ( X

1

X

2

E ( X

2

| X

2

) = X

2

X

2

X

2

X

2

X

2

For any random variable we have the expectation and moments. Moments can be centered

implying central moments. To move from uncentered moments to central moment we have to

subtract the expected value from the random variable. The most important central moment is the

second one, which is the variance. The positive squared root of the second moment is denoted by

σ and is called standard deviation. While, estimates of the standard deviation are called standard

error.

The specification of the regression model The specification of the regression model its an

important step when we want to estimate a model during an empirical analysis. Specifically in our

example, it means putting the disposable income in the right-hand side of the equation.

Therefore, I have to put all the relevant variables on the right side to explain the dependent

variable.

We made the assumption that the disturbance term conditional on x (disposable income) has zero

expected value:

If the assumption applies then our model becomes a model for the conditional expectation of the

dependent variable. We we try to model the conditional expectation of the consumption, given a

certain value of disposable income.

When we take the conditional expectation in the regression model we have to distinguish between

exogenous variables and endogenous one. We want to take the conditional expectation with

respect to exogenous variables. The difference consists in:

If the value of the variable is determined within the model the variable is said indigenous.

If the value of the variable is determined outside the model it is called exogenous.

In our consumption example disposable income is determined outside of the model, so it can be

treated as exogenous variable. This is not true for the consumption because we are trying to

identify it throw the disposable income.

The conditional set is denoted by

The disturbance term The disturbance term is a fundamental part of any econometric model,

in fact according to its definition we can use the OLS or not. Thus, everything depends on the

assumptions we make on the disturbance term. The basic assumption is that they are IID. This

rules out serial correlation, which is very common in time series data.

Another phenomenon which contradicts the IID is the heteroskedasticity = the disturbances have

different distribution. Here the second “ I ” doesn’t apply. While when we have all the disturbances

equal, we have homoskedasticity.

After identified the distributions of the disturbances, we can simulate an econometric model,

meaning creating a single virtual reality. We have a DGP (data generated process) and we select

one of the infinite virtual reality suggested by the economic model.

Steps to simulate a model:

We have to fix the sample size N. Using Gretl => new data set cross-sectional = 200; time

series = …; Panel = n for cross-sectional and t for the time series.

Add β1= 1.5 and β2 =2.

Fix the sample size, n.

Choose the parameters (here β 1 and β 2 ) of the deterministic specification.

Obtain the n successive values x t , t = 1,... , n, of the explanatory variable. As explained

above, these values may be real-world data or the output of another simulation. In Gretl we

have to add a distribution min 10 and max 20, name = disposable income.

Evaluate the n successive values of the regression function (the regression function is the

deterministic part of the regression model => part on the right of the equation without the

disturbance term): β 1

β 2 x t , for t = 1,…,n. In Gretl define new variable detf (deterministic

function) = 1.5 + 2.5 x ( x = disposable income).

Choose the probability distribution of the disturbances, if necessary specifying parameters

such as its expectation and variance;

t

Use a random-number generator to generate the n successive and mutually independent

values u t of the disturbances;

Form the n successive values y t of the dependent variable by adding the disturbances to the

values of the regression function.

Linear and Nonlinear Regression Models

N.B. the following model is said to be loglinear regression model, as it is linear after applying the

log.

Matrix algebra 2.

If a matrix has the same number of columns and rows, it is said to be square. A square matrix A is

symmetric if A ij

= A

ji for all i and j. Symmetric matrices occur very frequently in econometrics. A

square matrix is said to be diagonal if Aij = 0 for all i ̸= j; in this case, the only nonzero entries are

those on what is called the principal diagonal.

Sometimes a square matrix has all zeros above or below the principal diagonal. Such a matrix is

said to be triangular. If the nonzero elements are all above the diagonal, it is said to be upper-

triangular; if the nonzero elements are all below the diagonal, it is said to be lower-triangular.

The transpose of a matrix is obtained by

interchanging its row and column subscripts.

A matrix A is symmetric if and only if A = A

⊤

Arithmetic Operations on Matrices

Addition and subtraction^ needed same dimensions
Multiplication^ multiplication actually involves both additions and multiplications. It is based

on what is called the inner product, or scalar product, or sometimes dot product of two vectors.

Suppose that a and b are n-vectors. Then their inner product is:

When two matrices are multiplied together, the ij th element of the result is equal to the inner

product of the i th row of the first matrix with the j th column of the second matrix. Thus, if C =

AB:

To make sense, we must assume that A has m

columns and that B has m rows

The relationship between x and y is no linear.

Despite that these models are considered to be

linear, because the linearity in econometrics has

to be considered between the dependent

variable and the coefficient of the explanatory

variable => all these three models are linear

regression models.

Let y denote an n-vector with typical element yt , u an n-vector with typical element ut , X an n × 2

matrix that consists of a column of 1s and a column with typical element xt , and β a 2-vector with

typical element βi , i = 1, 2. Thus we have:

the column of 1 is the coefficient of β 1

When a regression model is written in the form below, the

separate columns of the matrix X are called regressors,

and the column vector y is called the regressand.

This entire matrix can be written very compactly as

It can be written as. As we will see in the next section, the least-

squares estimates of β depend only on the matrix and the vector

Partitioned Matrices

There are many ways of writing an n×k matrix X that are intermediate between the straightforward

notation X and the full element-by-element decomposition of X.

We might wish to separate the columns while grouping the rows

or we might wish to separate the rows but not the columns

To save space, we can also write this as. There is no restriction on how a matrix

can be partitioned, so long as all the sub-matrices or blocks fit together correctly.

Our aim is to estimate β = unknown coefficient we can do it through the estimation of models

Estimation techniques

Population mean = the mean of an entire population. DGP is the econometric way to define the

population mean.

Sample mean = mean of a sample selected from a population

The first way to estimate a model is called the method of moments. Imagine for the moment that

β2 is equal to zero in the DGP => no explanatory variable, but just the constant. Now the model

becomes. Consider for the moment the case where u is IID with zero expected value.

In this case is the unconditional expectation, because we have no x. This means that the

expectation of y is simply β.

X

T X

X

T y

X

T X

X

T y

y

t

= β + u

t

can be

rewritten

as

The method of moments applies directly, and we define the estimator of β by the sample mean of

the :

As the name (method of moments) implies, it can be used with moments other than the

expectation. In general, the method of moments estimates population moments by the

corresponding sample moments. However, we cannot apply the method of moments directly to

regression models, because, except in one trivial case that we discuss first, the parameters we

wish to estimate are not population means.

This means that in our DGP we have this model ( ) and the expectation of u at the DGP

level is equal to zero. We can use this result to impose the same condition in our sample.

DGP: at the DGP level E[ ] = 0

From the DGP we can derive an infinite number of sample, but we have to choose just one of

them. If the expectation of u is equal to zero, we can say that u within our sample is not 0, but on

average must be equal to 0.

equivalent to say or

=>

(^) β is the true value, is the value at the DGP level. To denote that it is our estimated coefficient

=> it is an estimated value.

The rule to derive the estimate is called estimator

In a a regression model where we have just the constant and we do not have

the explanatory variable, the estimated β ( ) is simply the sample average of

the dependent variable. : Estimate of the β used in the DGP

Every time I compute a sample average is kind of running a regression where there is constant

as a regressor*.

Model DGP: , assuming

From the GDP we can derive a sample and observe from yi to yn. Now, we want to estimate =?

Having the an equation in one unknown, so we need at least one constraint in order to derive

correctly. Therefore, if we assume that the first moment of the distribution of the disturbance

term = 0, we may think that the same story may apply at the sample level. We are translating an

assumption on the distribution of the disturbance term into in an equivalent assumption for the

sample.

The constraint can be translated => ∑ u = 0 (The sample average is equal to 0)

Thanks to this constraint ∑ (yt - β) = 0 … same of saying… yt - Nβ = 0

(This is u )

= ∑ yt = (sample average of y variable)

y t

y

t

= β + u

t

y

t

= β + u

t

u

t

n

i = 1

u i = u ¯ = 0

n

u i

( y i − β i

) ×

n

u 1 = y 1 − β 1 u 2 = y 2 − β 1 u 3 = y 3 − β 1 u n = y n − β 1

− β i

n

n

i = 1

y i

β i

n

n

i = 1

y i

y

t

= β + u

t

E [ u

t

] = 0

E [ u

t

] = 0

1

n

1

n

1

n

n

∑ t = i

1

n

1

n

y ¯

*See 2.37 book

= ∑ yt =

1

n

y ¯

For the multiple linear regression model, the expression y t

− X

t β is equal to the disturbance for

the t

th observation, but only if the correct value of the parameter vector β is used (true β).

If the same expression is thought of as a function of β, with β allowed to vary arbitrarily, then

above we called it the residual associated with the t

th observation. Similarly, the n-vector y − Xβ is

called the residual vector.

Therefore, is the disturbance term and we don’t know it, because we don’t know the value of

the true β. If we think about a possible range of values for β, we call u in such a case residual ( )

for the t

th observation.

The sum of the squares of the components of that vector is called the sum of squared residuals,

or SSR. Since this sum is a scalar, the sum of squared residuals is a scalar-valued function of the

k--vector β:

The notation here emphasizes the fact that this function can be computed for arbitrary values of

the argument β purely in terms of the observed data y and X.

With the Least-Squares Estimation we want to minimize the quantity Xtβ. Therefore, the idea of

least-squares estimation is to minimize the sum of squared residuals associated with a regression

model.

Consider briefly the simplest case of , in which β2 = 0 and the model contains only a

constant term. The expression above become:

For this model, the matrix X consists solely of the constant vector, ι. Therefore, X ⊤ X = ι ⊤ ι = n, and

X

⊤ y = ι ⊤ y = yt. Thus, if the first-order condition is multiplied by one-half, it can be rewritten as

ι ⊤ ιβ 1 = ι ⊤ y. Solving for β 1 yields the sample mean of the yt:

Not surprisingly, the OLS estimator is equivalent to the estimating-function estimator (2.45) for the

multiple linear regression model as well.

y

t

= Xβ + u

t

u

t

= y

t

− Xβ

→ u t

u ̂ t

u

t

= y

t

− x

t

β u ̂

t

β 1

β 2 x t
u t

n

∑ t = 1

Gretl

Residuals ( ) are the estimated disturbances. The disturbances are unobservable, we can just

estimate the disturbances. If we compute the variance of the residuals, and we take the square

root of that variance, we get the standard error of the regression.

The reported sum of square residuals is this where the β which minimizes the distance:

CHAPTER 3

The geometry of OLS estimation

Imagine to have a regression where there are two explanatory variables: x 1 and x 2. We can plot

these two vectors starting from a common origin: x 1 is a constant, while x 2 contains the

observations on the explanatory variable. These tow vectors define a space between x 1 and x 2.

We can also plot another vector containing the information about the dependent variable: vector

y. When we regress y onto x 1 and x 2 , we project y onto this space. This means that I draw 45

degree line from y to the space defined by the two regressors. Thus, the OLS projects y onto the

space spanned by the regressors.

=> we can apply the Pitagora’s Theorem

u ̂

The dependent variable is

the sale price. The result

allows us to understand

how for a little a change in

the number of baths or

bedrooms, the sale price

can change.

If we want to sum the two vectors, starting from point A we take y 1 and then we move up to

identify y 2 , finding the point C. We do the same for B searching x 1 and x 2 , finding C. Then we

connect the origin with the point C, and this is the resulting vector summing up the vectors x and

y. There is another way to sum vectors, moving parallel the segment of vectors, the diagonal of

the resulting parallelogram will be x + y.

The sum of the two vectors is then the diagonal through O of the resulting parallelogram. It is easy

to see that this classical method also gives the result that the sum of the two vectors is

represented by the arrow OC, since the figure OACB is just the parallelogram required by the

construction, and OC is its diagonal through O. The parallelogram construction also shows clearly

that vector addition is commutative, since y + x is represented by OB, for y, followed by BC, for x.

The end result is once more OC.

Multiplying a vector by a scalar is also very easy to represent

geometrically. If a vector x with components (x 1 ,x 2 ) is multiplied by a

scalar α, then αx has components (αx 1 , αx 2 ). This is depicted in

Figure 3.5, where α = 2. The line segments OA and OB represent x

and αx, respectively. It is clear that even if we move αx so that it

starts somewhere other than O, as with CD in the figure, the vectors x

and αx are always parallel. If α were negative, then αx would simply

point in the opposite direction. Thus, for α = −2, αx would be

represented by DC, rather than CD.

Another property of multiplication by a scalar is:

∥αx∥ = ⟨αx,αx⟩

1/ = |α|(x

⊤ x)

1/ = |α|∥x∥

Now, we are considering the inner product of vectors ⟨x, y⟩. The result is given by:

⟨x, y⟩ = ∥x∥ ∥y∥ cos θ

θ is the angle between the two vectors.
If the vectors are parallel the angle between them is 0 and thus cos θ = 1.
If the vectors are perpendicular they form a right angle, so the cosine of π/2 = 0. In such a case

the product of x and y is zero.

X Y ⟨x, y⟩ = ∥x∥ ∥y∥ X ⟨x, y⟩ = ∥x∥ ∥y∥ cos (π/2) = 0

Y

When I have two orthogonal / perpendicular vectors their product is zero

Immagine to have a regression model, where I have x 1 and x 2 , that are vectors containing

observation on the first regressor and on the second one. They span a surface. Then we have a

vector y containing the information on the dependent variable. When we regress y, we project y

onto the space spanned by x 1 and x 2. So, we draw a perpendicular straight line. The line form a

right angle triangle to the plane create by x 1 and x 2. So, if we multiply for x 1 or for x 2 , the product

is always 0. This is a numerical property of the OLS.

Subspaces of Euclidean Space: we can identify two different Euclidean spaces. The first one is

the one where we draw the variability of the dependent variable, and the second is the one used

to represents the residuals of the regression.

To see how this works, we need the concept of a subspace of a Euclidean space E

n

. Normally,

such a subspace has a dimension lower than n. The easiest way to define a subspace of E

n is in

u ̂

terms of a set of basis vectors. A subspace that is of particular interest to us is the one for which

the columns of X provide the basis vectors. We may denote the k columns of X as x 1 , x 2 ,... x k

Then the subspace associated with these k basis vectors is denoted by S(X ) or S(x 1 ,... , x k

The basis vectors are said to span this subspace, which in general is a k-dimensional subspace.

x 2

β 2 x 2 X X

X

X X

X

β 1 x 1

When we define a matrix x, we are defining a space where we project the y variable. Adding

regressors, means adding space where to project y.

Y = β 1 + u => Y = βx + u

ι

In this case we have a space like this:

y

In more complex cases, we have:

y

x 2

x

2 x 2

1 x 1 x 1

Thus, applying the OLS we will have two subspaces:

The first one, whose dimension are K, will be the space where we project y
The remaining dimension given by N - K is the space where the residuals live.

The subspace defined here is called the subspace spanned by the x i

i = 1,... , k, or the column space of X; less formally, it may simply be referred to as the span of X,

or the span of the xi. The orthogonal complement of S(X ) in E

n , which is denoted S

⊥ (X ), is the

set of all vectors w in E

n that are orthogonal to everything in S(X). This means that, for every z in

S(X), ⟨w, z⟩ = w

⊤ z = 0. Formally:

β

u ̂

Applying the OLS we have a

just 1 dimension => 1 line