

















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Analysis Estimating Causal Models, Lecture Slide - Engineering, Advanced Data Analysis, Estimating Causal Models, [Causal Effects, Interventions and Experiments], de-confounds, causal sufficiency, front-door criterion, The Back-Door Criterion, Instrumental Variables, Matching and Propensity Scores
Typology: Slides
1 / 25
This page cannot be seen from the preview
Don't miss anything!


















1 Causal Effects, Interventions and Experiments 2 1.1 The Special Role of Experiment................... 3
2 Identification and Confounding 5
3 Identification Strategies 7 3.1 The Back-Door Criterion: Identification by Conditioning..... 9 3.2 The Front-Door Criterion: Identification by Mechanisms..... 11 3.2.1 The Front-Door Criterion and Mechanistic Explanation. 11 3.3 Instrumental Variables........................ 14 3.3.1 Critique of Instrumental Variables............. 15 3.4 Failures of Identification....................... 18
4 Matching and Propensity Scores 20
5 Summary 22 5.1 Further Reading........................... 22
6 Exercises 23
There are two problems which are both known as “causal inference”:
The first problem is easier, so we’ll begin with it.
Probabilistic conditioning Causal conditioning
Pr (Y |X = x) Pr (Y |do(X = x)) Factual Counter-factual Select a sub-population Generate a new population Predicts passive observation Predicts active manipulation Calculate from full DAG Calculate from surgically-altered DAG Always identifiable when X and Y Not always identifiable even are observable when X and Y are observable
Table 1: Contrasts between ordinary probabilistic conditioning and causal con- ditioning. (See below on identifiability.)
As a reminder, when I talk about the causal effect of X on Y , which I write
Pr (Y |do(X = x)) (1)
I mean the distribution of Y which would be generated, counterfactually, were X to be set to the particular value x. This is not, in general, the same as the ordinary conditional distribution
Pr (Y |X = x) (2)
The reason is that the latter represents taking the original population, as it is, and just filtering it to get the sub-population where X = x. The processes which set X to that value may also have influenced Y through other channels, and so this distribution will not, typically, really tell us what would happen if we reached in and manipulated X. We can sum up the contrast in a little table (Table 1). As we saw two lectures ago, if we have the full graph for a directed acyclic graphical model, it tells us how to calculate the joint distribution of all the variables, from which of course the conditional distribution of any one variable given another follows. As we saw in the last lecture, calculations of Pr (Y |do(X = x)) use a “surgically altered” graph, in which all arrows into X are deleted, and its value is pinned at x, but the rest of the graph is as before. If we know the DAG, and we know the distribution of each variable given its parents, we can calculate any causal effect we want, by graph-surgery.
there are many experiments we shouldn’t do, and because there are many exper- iments which would just be too hard to organize. We must therefore consider how to do causal inference from non-experimental, observational data.
For today’s purposes, the most important distinction between probabilistic and causal conditioning has to do with the identification (or identifiability), of the conditional distributions. An aspect of a statistical model is identifiable when it cannot be changed without there also being some change in the dis- tribution of the observable variables. If we can alter part of a model with no observable consequences, that part of the model is unidentifiable^2. Sometimes the lack of identification is trivial: in a two-component mixture model, we get the same observable distribution if we swap the labels of the two component distributions. The rotation problem for factor models is a less trivial identi- fication problem^3. If two variables are co-linear, then their coefficients in a linear regression are unidentifiable^4. Note that identification is about the true distribution, not about what happens with finite data. A parameter might be identifiable, but we could have so little information about it in our data that our estimates are unusable, with immense confidence intervals; that’s unfortunate, but we just need more data. An unidentifiable parameter, however, cannot be estimated even with infinite data.^5 When X and Y are both observable variables, Pr (Y |X = x) can’t help being identifiable. (Changing this just is changing part of the distribution of observ- ables.) Things are very different, however, for Pr (Y |do(X = x)). In some mod- els, it’s entirely possible to change this drastically, and always have the same distribution of observables, by making compensating changes to other parts of the model. When this is the case, we simply cannot estimate causal effects from observational data. The basic problem is illustrated in Figure 1 In Figure 1, X is a parent of Y. But if we analyze the dependence of Y on X, say in the form of the conditional distribution Pr (Y |X = x), we see that there are two channels by which information flows from cause to effect. One is the direct, causal path, represented by Pr (Y |do(X = x)). The other is the indirect path, where X gives information about its parent U , and U gives in- formation about its child Y. If we just observe X and Y , we cannot separate the causal effect from the indirect inference. The causal effect is confounded with the indirect inference. More generally, the effect of X on Y is confounded whenever Pr (Y |do(X = x)) 6 = Pr (Y |X = x). If there is some way to write Pr (Y |do(X = x)) in terms of distributions of observables, we say that the con- founding can be removed by an adjustment, or an identification strategy,
(^2) More strictly, say that the model has two parameters, θ and ψ. The distinction between θ 1 and θ 2 is identifiable if, for all ψ 1 , ψ 2 , the distribution over observables coming from (θ 1 , ψ 1 ) is different from that coming from (θ 2 , ψ 2 ). If the right choice of ψ 1 and ψ 2 masks the distinction between θ 1 and θ 2 , then θ is unidentifiable. (^3) As this example suggests, what is identifiable depends on what is observed. If we could observe the factors directly, factor loadings would be identifiable. (^4) As that example suggests, whether one aspect of a model is identifiable or not can depend on other aspects of the model. If the co-linearity was broken, the two regression coefficients would become identifiable. (^5) For more on identifiability, and what to do with unidentifiable problems, see the great book by Manski (2007).
To recap, we want to calculate the causal effect of X on Y , Pr (Y |do(X = x)), but we cannot do an actual experiment, and must rely on observations. In addition to X and Y , there will generally be some covariates Z which we know, and we’ll assume we know the causal graph, which is a DAG. Is this enough to determine Pr (Y |do(X = x))? That is, does the joint distribution identify the causal effect? The answer is “yes” when the covariates Z contain all the other relevant variables^6. The inferential problem is then no worse than any other statistical estimation problem. In fact, if we know the causal graph and get to observe all the variables, then we could (in principle) just use our favorite non-parametric conditional density estimate at each node in the graph, with its parent variables as the inputs and its own variable as the response. Multiplying conditional distributions together gives the whole distribution of the graph, and we can get any causal effects we want by surgery. Equivalently (Exercise), we have that
Pr (Y |do(X = x)) =
t
Pr (Y |X = x, Pa(X) = t) Pr (Pa(X) = t) (3)
where Pa(X) is the complete set of parents of X. If we’re willing to assume more, we can get away with just using non- parametric regression or even just an additive model at each node. Assuming yet more, we could use parametric models at each node; the linear-Gaussian assumption is (alas) very popular. If some variables are not observed, then the issue of which causal effects are observationally identifiable is considerably trickier. Apparently subtle changes in which variables are available to us and used can have profound consequences. The basic principle underlying all considerations is that we would like to condition on adequate control variables, which will block paths linking X and Y other than those which would exist in the surgically-altered graph where all paths into X have been removed. If other unblocked paths exist, then there is some confounding of the causal effect of X on Y with their mutual dependence on other variables. This is familiar to use from regression as the basic idea behind using ad- ditional variables in our regression, where the idea is that by introducing co- variates, we “control for” other effects, until the regression coefficient for our favorite variable represents only its causal effect. Leaving aside the inadequacies of linear regression as such (that’s what we spent the first third of the class on),
(^6) This condition is sometimes known as causal sufficiency. Strictly speaking, we do not have to suppose that all causes are included in the model and observable. What we have to assume is that all of the remaining causes have such an unsystematic relationship to the ones included in the DAG that they can be modeled as noise. (This does not mean that the noise is necessarily small.) In fact, what we really have to assume is that the relationships between the causes omitted from the DAG and those included is so intricate and convoluted that it might as well be noise, along the lines of algorithmic information theory (Li and Vit´anyi, 1997), whose key result might be summed up as “Any determinism distinguishable from randomness is insufficiently complex”. But here we verge on philosophy.
Figure 2: “Controlling for” additional variables can introduce bias into esti- mates of causal effects. Here the effect of X on Y is directly identifiable, Pr (Y |do(X = x)) = Pr (Y |X = x). If we also condition on Z however, because it is a common effect of X and Y , we’d get Pr (Y |X = x, Z = z) 6 = Pr (Y |X = x). In fact, even if there were no arrow from X to Y , conditioning on Z would make Y depend on X.
we need to be cautious here. Just conditioning on everything possible does not give us adequate control, or even necessarily bring us closer to it. As Figure 2 illustrates, and as Homework 11 will drive home, adding an ill-chosen covariate to a regression can create confounding. There are three main ways we can find adequate controls, and so get both identifiability and appropriate adjustments:
from Eq. 3 that
Pr (Y |do(X = x)) =
t
Pr (Pa(X) = t) Pr (Y |X = x, Pa(X) = t) (5)
Now suppose we can always introduce another set of conditioned variables, if we sum out over them:
Pr (Y |do(X = x)) =
t
Pr (Pa(X) = t)
s
Pr (Y, S = s|X = x, Pa(X) = t)
(6) We can do this for any set of variables S, it’s just probability. It’s also just probability that
Pr (Y, S|X = x, Pa(X) = t) = (7) Pr (Y |X = x, Pa(X) = t, S = s) Pr (S = s|X = x, Pa(X) = t)
so
Pr (Y |do(X = x)) = (8) ∑
t
Pr (Pa(X) = t)
s
Pr (Y |X = x, Pa(X) = t, S = s) Pr (S = s|X = x, Pa(X) = t)
Now we invoke the fact that S satisfies the back-door criterion. Point (i) of the criterion, blocking back-door paths, implies that Y |= Pa(X)|X, S. Thus
Pr (Y |do(X = x)) = (9) ∑
t
Pr (Pa(X) = t)
s
Pr (Y |X = x, S = s) Pr (S = s|X = x, Pa(X) = t)
Point (ii) of the criterion, not containing descendants of X, means (by the Markov property) that X |= S|Pa(X). Therefore
Pr (Y |do(X = x)) = (10) ∑
t
Pr (Pa(X) = t)
s
Pr (Y |X = x, S = s) Pr (S = s|Pa(X) = t)
Since
t Pr (Pa(X) =^ t) Pr (S^ =^ s|Pa(X) =^ t) = Pr (S^ =^ s), we have, at last,
Pr (Y |do(X = x)) =
s
Pr (Y |X = x, S = s) Pr (S = s) (11)
as promised.
A set of variables M satisfies the front-door criterion when (i) M blocks all directed paths from X to Y , (ii) there are no unblocked back-door paths from X to M , and (iii) X blocks all back-door paths from M to Y. Then
Pr (Y |do(X = x)) = (12) ∑
m
Pr (M = m|X = x)
x′
Pr (Y |X = x′, M = m) Pr (X = x′)
A natural reaction to the front-door criterion is “Say what?”, but it becomes more comprehensible if we take it apart. Because, by clause (i), M blocks all directed paths from X to Y , any causal dependence of Y on X must be mediated by a dependence of Y on M :
Pr (Y |do(X = x)) =
m
Pr (Y |do(M = m)) Pr (M = m|do(X = x)) (13)
Clause (ii) says that we can get the effect of X on M directly,
Pr (M = m|do(X = x)) = Pr (M = m|X = x). (14)
Clause (iii) say that X satisfies the back-door criterion for identifying the effect of M on Y , and the inner sum in Eq. 12 is just the back-door computation (Eq.
3.2.1 The Front-Door Criterion and Mechanistic Explanation
Morgan and Winship (2007, ch. 8) give a useful insight into the front-door criterion. The each directed path from X to Y is, or can be thought of as, a separate mechanism by which X influences Y. The requirement that all such paths be blocked by M , (i), is the requirement that the set of mechanisms included in M be “exhaustive”. The two back-door conditions, (ii) and (iii), require that the mechanisms be “isolated”, not interfered with by the rest of the data-generating process (at least once we condition on X). Once we identify an isolated and exhaustive set of mechanisms, we know all the ways in which X actually affects Y , and any indirect paths can be discounted, using the front- door adjustment 12. One interesting possibility suggested by this is to elaborate mechanisms into sub-mechanisms, which could be used in some cases where the plain front-door criterion won’t apply^8 Figure 5. Because U is a parent of M , we cannot use the front-door criterion to identify the effect of X on Y. (Clause (i) holds, but (ii) and (iii) both fail.) But we can use M 1 and the front-door criterion to find
(^8) The ideas in this paragraph come from Prof. Winship, who I understand is currently (April 2011) preparing a paper on this.
Figure 5: The path X → M → Y contains all the mechanisms by which X influences Y , but is not isolated from the rest of the system (U → M ). The sub-mechanisms X → M 1 → M and M → M 2 → Y are isolated, and the original causal effect can be identified by composing them.
Figure 6: A valid instrumental variable, I, is related to the cause of interest, X, and influences Y only through its influence on X, at least once control variables block other paths. Here, to use I as an instrument, we should condition on S, but should not condition on B. (If we could condition on U , we would not need to use an instrument.)
A variable I is an instrument^9 for identifying the effect of X on Y when there is a set of controls S such that (i) I 6 |= X|S, and (ii) every unblocked path from I to Y has an arrow pointing into to X. Another way to say (ii) is that I |= Y |S, do(X). Colloquially, I influences Y , but they are only dependent through I first influencing X. (See Figure 6.) How is this useful? By making back-door adjustments for S, we can identify Pr (Y |do(I = i)) and Pr (X|do(I = i)). Since all the causal influence of I on Y must be channeled through X (by point (ii)), we have
Pr (Y |do(I = i)) =
x
Pr (Y |do(X = x)) Pr (X = x|do(I = i)) (15)
as in Eq. 3. We can thus identify the causal effect of X on Y whenever Eq. 15 can be solved for Pr (Y |do(X = x)) in terms of Pr (Y |do(I = i)) and Pr (X|do(I = i)). Unfortunately, it is not possible to find a unique solution in general. In the very special case where the dependence of X on I and of Y on X are both linear, (^9) The term “instrumental variables” comes from econometrics, where they were originally used, in the 1940s, to identify parameters in simultaneous equation models. (The metaphor was that I is a measuring instrument for the otherwise inaccessible parameters.) Definitions of instrumental variables are surprisingly murky and controversial outside of extremely simple linear systems; this one is taken from Galles and Pearl (1997), via Pearl (2009b, §7.4.5).
using our scientific knowledge, or rather theories, to find isolated and exhaustive mechanisms, finding valid instruments would seem to rest a lot on theories about the world (or the part of it under study), and one would want to try to check those theories. In fact, instrumental variable estimates of causal effects are often presented as more or less unquestionable, and free of theoretical assumptions; economists, and other social scientists influenced by them, are especially apt to do this. As the economist Daniel Davies puts it^11 , devotees of this approach
have a really bad habit of saying: “Whichever way you look at the numbers, X”. when all they can really justify is: “Whichever way I look at the numbers, X”. but in fact, I should have said that they could only really support: “Whichever way I look at these numbers, X”.
(Emphasis in the original.) It will not surprise you to learn that I think this is very wrong. I hope that, after four months of nonlinear models, if someone tries to sell you a linear regression, you should be very skeptical, but let’s leave that to one side. (It’s not impossible that everything really is linear.) The clue that instrumental variable estimation is a creature of theoretical assumptions is point (ii) in the definition of an instrument: I |= Y |S, do(X). This says that if we eliminate all the arrows into X, the control variables S block all the other paths between I and Y. This is exactly as much an assertion about mechanisms as what we have to do with the front-door criterion. In fact it doesn’t just say that every mechanism by which I influences Y is mediated by X, it also says that there are no common causes of I and Y (other than those blocked by S). This assumption is most easily defended when I is genuinely random, For instance, if we do a randomized experiment, I might be a coin-toss which assigns each subject to be in either the treatment or control group, each with a different value of X. If “compliance” is not perfect (if some of those in the treatment group don’t actually get the treatment, or some in the control group do), it is nonetheless plausible that the only route by which I influences the outcome is through X, so an instrumental variable regression is appropriate. (I here is sometimes called “intent to treat”.) Even here, we must be careful. If we are evaluating a new medicine, whether people think they are getting a medicine or not could change how they act, and medical outcomes. Knowing whether they were assigned to the treatment or the control group would thus create another path from I to Y , not going through X. This is why randomized clinical trials are generally “double-blinded” (neither patients nor medical personnel know who is in the control group); but how effective the double-blinding is itself a theoretical assumption. More generally, any argument that a candidate instrument is valid is really an argument that other channels of influence, apart from the favored one through (^11) In part four of his epic and insightful review of Freakonomics; see http:// d-squareddigest.blogspot.com/2007/09/freakiology-yes-folks-its-part-4-of.html.
X, can be ruled out. This generally cannot be done through analyzing the same variables used in the instrumental-variable estimation (see below), but involves some theory about the world, and rests on the strength of the evidence for that theory. As has been pointed out multiple times — for instance, by Rosenzweig and Wolpin (2000) — the theories needed to support instrumental variable estimates in particular concrete cases are often not very well-supported, and plausible rival theories can produce very different conclusions from the same data. Many people have thought that one can test for the validity of an instrument, by looking at whether I |= Y |X — the idea being that, if influence flows from I through X to Y , conditioning on X should block the channel. The problem is that, in the instrumental-variable set-up, X is a collider, so conditioning on X actually creates an indirect dependence even if I is valid. So I 6 |= Y |X, whether or not the instrument is valid, and the test (even if performed perfectly with infinite data) tells us nothing^12. A final, more or less technical, issue with instrumental variable estimation is that many instruments are (even if valid) weak — they only have a little influence on X, and a small covariance with it. This means that the denominator in Eq. 22 is a number close to zero. Error in estimating the denominator, then, results in a much larger error in estimating the ratio. Weak instruments lead to noisy and imprecise estimates of causal effects. It is not hard to construct scenarios where, at reasonable sample sizes, one is actually better off using the biased OLS estimate than the unbiased but high-variance instrumental estimate.
(^12) However, see Pearl (2009b, §8.4) for a different approach which can “screen out very bad would-be instruments”.
nodes in Figure 7). These include all the traits of either person which (a) in- fluence who they become friends with, and (b) influence whether or not they become obese. A very partial list of these would include: taste for recreational exercise, opportunity for recreational exercise, taste for alcohol, ability to con- sume alcohol, tastes in food, occupation and how physically demanding it is, ethnic background^14 , etc. Put simply, if Irene and Joey are friends because they spend two hours in the same bar every day drinking and eating wings, it’s less surprising that both of them have an elevated chance of becoming obese, and likewise if they became friends because they both belong to the decath- lete’s club, they are both unusually unlikely to become obese. Irene’s status is predictable from Joey’s, then, not (or not just) because Joey influences Irene, but because seeing what kind of person Irene’s friends are tells us about what kind of person Irene is. It is not too hard to convince oneself that there is just no way, in this DAG, to get at the causal effect of Joey’s behavior on Irene’s that isn’t confounded with their latent traits (Shalizi and Thomas, 2011). To de-confound, we would need to actual measure those latent traits, which may not be impossible but is certainly was not done here^15. When identification is not possible — when we can’t de-confound — it may still be possible to bound causal effects. That is, even if we can’t say exactly that Pr (Y |do(X = x)) must be, we can still say it has to fall within a certain (non- trivial!) range of possibilities. The development of bounds for non-identifiable quantities, what’s sometimes called partial identification, is an active area of research, which I think is very likely to become more and more important in data analysis; the best introduction I know is Manski (2007).
(^14) Friendships often run within ethnic communities. On the one hand, this means that friends tend to be more genetically similar than random members of the same town, so they will be usually apt to share genes which influence susceptibility to obesity (in that environment). On the other hand, ethnic communities transmit, non-genetically, traditions regarding food, alcohol, sports, exercise, etc., and (again non-genetically) influence employment opportunities. (^15) Of course, the issue is not really about obesity. Studies of “viral marketing”, and of social influence more broadly, all generically have the same problem. Predicting someone’s behavior from that of their friend means conditioning on the existence of a social tie between them, but that social tie is a collider, and activating the collider creates confounding.
Suppose that our causal variable of interest X is binary, or (almost equivalent) that we are only interested in comparing the effect of two levels, do(X = x 1 ) and do(X = x 2 ). Let’s call these the “treatment” and “control” groups for definiteness, though nothing really hinges on one of them being in any sense a normal or default value (as “control” suggests). A common estimation strategy, especially when what we are interested in is the difference between treated cases and controls, is matching. For each treated case, we try to find a control case which has similar values for the covariates S^16. Taking the difference in Y between the treated case and its matched control then gives an indication of how much effect X has on the expected value of Y. If the number of covariates is large, a sort of curse of dimensionality sets in, and it can become extremely hard to find matches. A very clever idea, due to Rosenbaum and Rubin (1983), reduces the number of covariates we have to match on to one dimension. This is what is called the propensity score, the probability of being in the treated group as a function of the covariates:
ρ(s) = Pr (X = treatment|S = s) (24)
The trick is that this propensity score is a sufficient statistic for predicting treatment status, in the sense that knowing the full covariates tells us no more than just knowing ρ: X |= S|ρ (25)
Consequently, when we take a treated case, we can match it to a control case with the same propensity score — one which was just as likely to receive treat- ment, but, as it happens, did not. If we are interested not just in the difference in expected Y ’s, E [Y |do(X = x 1 )]− E [Y |do(X = x 2 )], but the full causal effects, Pr (Y |do(X = x 1 )) and Pr (Y |do(X = x 2 )), then we do not want to do matching. But if we could use S to do back-door ad- justment, we can also use ρ, and doing so is apt to be computationally simpler, and perhaps more stable. There are two crucial issues to bear in mind while using propensity scores: score computation, and causal adequacy. Except in extremely unusual circumstances, we do not have an analytical formula for ρ(s). This means that it must be modeled and estimated. The most common model seems to be logistic regression, but so far as I can see this is just for computational convenience. Since accurate propensity scores are needed to make the method work, it would seem to be worthwhile to model ρ very carefully. The more important, and neglected, issue is that calculating a propensity score doesn’t put any new information into S, it just summarizes what it has to say about X. If S was an adequate control to prevent confounding, then so
(^16) If no exact match is available, we might match to within some distance, or do some sort of kernel-weighted matching. See, e.g., Morgan and Winship (2007) for details.