












































Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Appunti lezioni anno 2023 prof. Giupponi (prima metà)
Tipologia: Appunti
1 / 52
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!













































Lecture 1 – Introduction Policy evaluation = evaluating consequences of given policy interventions. Objectives of the course Introduce the main econometric techniques used in data analysis for the identification of causal relationships. The methods covered will enable us to:
Example - Relationship between education and labour market outcomes Data for the US in 2021. Strong gradient where people who are better educated tend to earn higher wages/negative gradient where higher levels of education are correlated with lower unemployment rate. We see a strong association between schooling and earnings, does it mean that schooling increases earnings? = there is a correlation, is there causality? Possible explanations of relationship between education and earnings:
The variance of the sample mean is: The more dispersed the population is going to be, the more dispersed is the distribution of the sample mean is going to be // the more observations I can draw in my sample, the lower the variance of the sample mean is going to be. If I could get samples that are larger, I would on avg get closer to the true population mean. We don’t observe sigma or mu because they are population parameters (characteristics). We need to derive an estimator for sigma to find the sampling standard deviation. We need to have a sample counterfact of this formula which is the standard error of the sample mean (something we can actually calculate with the data). Back to assessing differences in means How will we approach the question of whether it is a real difference or something due to sampling variation? We will construct a test statistic and then form a belief (= define a hypothesis ) about how the world really works based upon this this often means asking “How surprised should I be by what I observe if a particular model of the world were true?” For example, consider case in which we want to understand whether students who are allocated to smaller/larger classes achieve better results. We are going to start from a given hypothesis “there is no difference in test scores”, we’ll go to the data, observe the sample, use our estimator, and construct the mean of those in large and those in small classes to compare the two. Question we’ll ask ourselves is how surprising would it be to observe that test scores for students randomly allocated to smaller classes were five points higher if in fact class size does not matter? Data on people who have free health insurance (=can consume as much health services as they want) and those who have minimal health coverage (= covers only catastrophic events). The avg health expenses of those under the minimal one is 636, the difference with others is 285. Below in brackets we have the standard error of that means. Is this 285 real (= is it statistically relevant/is that statistically significant) or just outcome of sampling variation? Statisticians have developed formal methods to quantify how likely or unlikely a difference is. The starting point is to compare our estimate to some fixed benchmark, a procedure called hypothesis testing. Classical hypothesis testing begins with formulating a null hypothesis: for example, providing free care does not affect total health expenditures (typically test the null hypothesis that the treatment had no effect) Then, we test the null hypothesis against an alternative one and do so taking a falsification approach = the null hypothesis is true unless the data provides strong evidence against it. We reject the null if we have evidence beyond a reasonable doubt against it. Let’s test out our null hypothesis Is it plausible that we would observe a difference of 285 if indeed free care did not affect usage? First building block is defining the null, then t-statistic = a standardized difference in averages between observation and hypothesis: 285 is difference in averages, 72 is standard error and mu0 = 0
We need another building block of hypothesis testing: central limit theorem. By the central limit theorem, if the data come from a distribution with a difference in means of mu0, then the t-statistic has a standard normal distribution. We use this result to: Perform hypothesis testing Compute p-values for the hypothesis Confidence intervals Classical hypothesis testing Starting from a null hypothesis, we have to test whether to reject it or not. First, we compute the t-statistic is that t large enough or not? It depends on the research, as researchers we fix the critical values (= the benchmark). It is up to us whether to decide if it is “large” enough to reject the null hypothesis. Suppose we adopt rule that we reject the null when abs. value of t statistic is greater than 1.96 (this number means we reject the null at a 5% confidence level) this comes from the central limit theorem We observe a t statistic in the tails only 5% of the time. The areas under the tails are called “rejection region”, while the area in the middle is the “acceptance region”. The number used as benchmark is called “critical value”. Intuition is that 5% means we allow our hypothesis testing to reject a null with a 5% confidence level, we are willing to take the risk that we commit a type 1 error (= reject the null when it’s actually true) 5% of the time. We can be more risk averse and choose a 1% confidence level, critical value increases to 2.58 we reduce the rejection region and increase the acceptance region. Picking a significance level means how often we are willing to reject the null hypothesis when it is actually true. The standard in economics is set at 5% (sometimes 10 or 1%). We set the decision rule = we pick a significance level or size. 3.96 is greater than 1.96 so we reject the null, also > 2.58 so we are able to reject the null at 5% confidence level and also 1% confidence level. P-values They are usually computed from the t-statistic, computed by STATA. P-value is the probability of observing an estimate at least as adverse to the null hypothesis as the one you actually observed in your sample The smaller the p-value, the lower the likelihood of observing the actual estimate under the null hypothesis so lower p-value means more reason to reject the null p-values are a more informative reparameterization of the test results, rather than simply reject/don’t reject. It is computing the area under the tails. Measure of the area under the tails from 1.96 so 0.05 = 5%, if the t is 2.58 then p- value is 1%. It is a measure of the area under the tails given a particular value of t. For example, if the p-value is 0.03, we would reject the null at the 5%-level but not at the 1%-level The p-value for our t-statistic of 3.96 is 0.000075 (very small)
Even according to p-value, we reject the null. If p-value is greater than 5%, we cannot reject the null / if lower, we reject the null. We don’t use standard deviations but actual values. Economic and statistical significance are not the same! The 95%-confidence interval for the difference in health expenses for the groups with free and catastrophic coverage is [$144, $426]. We easily (p-value<0.01) reject the hypothesis that expenditures are the same in the two groups But is the mean difference of $285 large or small? Economic significance has to do with how big a point estimate is. Summary We use the sampling distribution to construct standard errors and with them: formal t-tests, p-values, and confidence intervals: Standard errors and confidence intervals tell us about the precision of our estimates Hypothesis tests and p-values tell us whether our data are consistent with a priori specified values of our parameters of interest
Lecture 3 – The linear regression model Suppose we are interested in relationship between school resources and achievements of students in school districts in California. Increasing school resources means increasing number of teachers as to decrease the student teacher ratio = nr of students / nr of teachers Consider a school authority considering a change in its class size policy: the authority is considering hiring additional teachers to reduce class sizes. To evaluate this policy, it would like to know if doing so will improve student performance. Sample: California school districts (𝑛𝑛 = 420) for 1999 Variables: District mean 5th grade test scores Student-teacher ratio = number of students in district divided by the number of full-time equivalent teachers, a measure of how much resources are invested in the district (larger nr of teachers = larger amount of resources) We are going to start by thinking about regression as a tool to describe data explore the mechanical features of regression We will then come back to the causal question = “do smaller classes result in better outcomes for students?” Whenever using any type of data, best way to start is to plot the data. Look at raw correlations between the 2 variables and see how they look like in a graph. Good first step. Test scores as our y, student-teacher ratio as x. Usually, y is the dependent variable, which is a function of many x, including nr of students per teacher. What do we learn from this graph? A lot of dispersion, huge variations in test scores even for districts with the same class size Test scores are probably related to many other factors besides class size A visible relationship between class size and test scores, test scores are determined by many factors but here we only look at two variables so can’t understand a lot. How can we better summarize (check whether or not there is a relationship) this relationship? A simple way to summarize the relationship between the two variables is to fit a line = a regression line We want to draw a line through all these points that fits “as closely as possible” (draw line as close as possible to the points) line should minimize distance between line and each of the dots Think of relationship between dependent variable and the x as a linear function which can be written as: Alfa is the intercept (min value of the test scores), 𝛽 is the slope of the line, tells us the difference we would expect to see in test scores for each unit difference in the student-teacher ratio, it is multiplying our x
eg. population density or selection of students according to house prices. How results look like in STATA: “Constant” is the intercept alfa Str is the slope beta When we run a regression, we estimate alfa and beta separately and this means (given that for the property of the OLS, the error are 0) that we can construct a predicted value of y, knowing x for each value of x, knowing alfa and beta, we can have a predicted value of y (= the outcome, in this case test scores). Suppose we want to know the predicted value of y when observing classes of ratio 20. I need to substitute x with 20 and compute: y = alfa – 2.28 x 20 = 653. We can use alfa and beta to predict values of the outcome y. Summary Scatterplots are a good way to display bivariate relationships:
In STATA, our results look like this: Estimate of std error = a measure of the precision of our estimates Remember, point estimates and std err are building blocks for computing the t-statistic so for hypotheses testing, having 0.480 in mind, let’s test for the significance for the slope of the line. We are usually interested in testing whether the slope is different from 0 or not. Hypotheses testing starts with the null hypothesis, which in this case is Beta = 0, then alternative hypothesis is beta different from 0. There are 3 ways to perform hypothesis testing:
What if we have a relationship like the one between test scores and district income? The linear regression does not fit too well. There is some curvature (most point are below the OLS line when income is very low or very high, but they are above the line when income is between 15 and 30K), in this relationship between test scores and district income, that is not captured by the linear regression it seems the relationship between district income and test scores is not a straight line it is non-linear. A non-linear function is a function with a slope that is not constant: f(x) is linear if the slope is the same for all values of x, but if the slope depends on value of x then f(x) is non-linear. Then if a line is not an accurate description of this relationship, what is? One way to address the issue and capture that concavity is to model the relationship as a quadratic function = we could model test scores as a function of income and the square of income. So, augment our linear regression by including a quadratic term. This fits the data better! This is the quadratic regression model. Next question is: so far we have talked about regression as linear, what is linear about linear regression? The linearity in the OLS regression function is in the parameters (not necessarily in the data). This is Cobb-Douglas production function which describes output of firm I as a function of alfa (total factor productivity), K is capital input and L is labour input in the firm. Say we are interested in estimating alfa, beta e gamma. We can’t use OLS on this type of formulation, but we can take a log transformation of that function and then run a regression. This function is not linear in the original data but then we can transform it in something that is linear in its parameters (beta and gamma in this case, not in alfa but we can calculate delta and then get alfa). How should we interpret specifications that are non-linear in the original data? Like one above where income enters linear and quadratic. (1,000 because income is reported in thousands of dollars so a 1 unit increase in income 1 is actually 1,000 dollar increase) In the quadratic specification, it is trickier. How to interpret association between test scores and avg income? Derivatives Treat the regression function as a function and take der of test scores wrt income this gives us association between test scores and income.
This association is no longer just a number but itself a function of income. This means we can plug in there different numbers eg. interested in association at the median of income you just plug the number in. More generally, we use the log transformation more than the quadratic specification, why? The log specification (logs on both sides of the regression) allows us to retrieve proportional effects: proportional changes of income are often more reasonable than additive changes. In terms of economics impact, a $1,000 change is pretty big for a district with an avg income of $15,000; a $1, change may be less meaningful for a district with an avg income of $40,000. Comparing a percentage change may be a more similar exercise = more meaningful to think in terms of proportional change. E.g., what should we expect when income is 10% larger? 1% larger? the log specification allows us to do so There are 3 different cases in which logarithms might be used:
We will often want to include more than one dummy variable. Each coefficient is interpreted in same way = incremental difference in the outcome between group with dummy switched on and switched off. We are allowing our regression to capture a different avg value of y Interaction between two or more variables Many of the descriptions we would like to see involve the interaction between two or more variables. Is the relationship between class size and test scores the same for:
When we take this model to the data, we can apply standard OLS (so minimize the sum of squared residuals) and obtain the OLS estimates of our coefficients alfa and beta. We’ll be interested more in beta = the slope of our regression line it can be rewritten as ratio between covariance between y and x / variance of x. Up to now this is statistics, we are given two variables and we try to draw a line between the cloud of points. Next step is understanding what info the beta coefficient is giving us, we might be tempted to give a causal interpretation to this beta coefficient: if I increase x by a given quantity, that will on avg generate a change in y; or a change in x is associated with a change in y (simply say that there is an association between the two variables which is described by beta, this is more accurate than saying “causes”). Question is “does the OLS-estimated coeff beta hat capture the true causal effect of x on y?” … in most cases, no! Example: education and earnings Imagine we have (fictitious) data for a set of people on number of years of education and on daily wages in euros. Causal effect of education on earnings is an important question for economics, social sciences and policy: schooling is one of the largest areas in which policy makers can intervene and invest trying to understand whether increasing access or quality of education can actually increase the earnings abilities of individuals, hence their earnings and the amount they are able to contribute, is generally important. We want to understand whether an increase in education causes an increase in earnings (I want to estimate the effect of additional years of education on earnings interested in understanding if an increase in years of education leads causally to an increase in earnings). Start by trying to look at the correlation between the two variables. We see scatterplot showing a positive association between the two: whenever years of education increase, we tend to observe higher daily wages. Can we conclude anything about causality between education and earnings? NO. But why? Various reasons why we need to be careful in interpreting beta. What we see could be in part a causal effect and in part just a mere association. There are at least two possible explanations of this relationship:
an indiv with 0 years of education would earn 61 euros a day but in practice this has no meaning in this context because there in this data, there is no one with 0 years of education (not particularly interesting to us). One very plausible omitted factor here is the ability of individuals: in particular, we talk about a specific type of ability which leads individuals to be both successful in schooling and in labour market. When comparing people that have high vs low education in the graph above: we see that this on avg corresponds to an earnings differential, but we are not considering the fact that indiv who have “high earnings” might also have high earnings ability irrespective of their education level (they are smarter and their skills are priced at a higher level in labour market). So, when we are comparing people with high vs low education, we are also implicitly comparing people with high earnings ability vs people with low earnings ability it could be that part or all of this association is explained by the fact that they are smarted rather than more educated! Imagine in the data we observe a distribution of education that is a combo of blue and red distributions here. Imagine then we have indicator for whether an indiv is smart or not (although this would be difficult to obtain because intelligence is multidimensional and just difficult to measure): if you split the sample between indiv who are smart and not smart, you see a strong correlation between being smart and having high education level. What this implies in practice is that in the previously shown linear correlation, we are fitting a line ignoring these two groups of people: we fit a line through education ignoring the fact that we actually have two different groups (smart and not smart). When we modify our regression, taking into account the fact that there are two groups, we see how the fitted line is very different the beta coeff (slope) is much lower because we are now accounting for the fact that part of that positive association was not at all accounted for by education, rather it was explained by earnings’ ability. How do we do this in practice? How can we separately capture the two dimensions in our regression? Dummy variables come into play. Assume that one extra year of education is associated with same increase in earnings for both low and high ability indiv (simplifying assumption) = relationship between education and earnings is the same irrespective of whether you are smart or not the two lines are parallel (same slope) We impose same slope for smart and not smart. What we are allowing for is for a different intercept because now we are running a regression which includes our indicator for being smart. There are these two clouds both having a positive association in there but there are different levels of earnings based on being smart or not smart. We include one extra variable = a dummy for being smart or not. The dummy allows our model to have two different intercepts (one for the smart and one for the not smart). Results are to be interpreted differently: