
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of factor analysis, a statistical method used to identify underlying factors that explain the covariance structure of a set of variables. The graphical model representation, observables correlation through factors, roots of factor analysis, estimation, and factor scores estimation. Spearman's one-factor model and the thomson sampling model are also discussed.
Typology: Slides
1 / 24
This page cannot be seen from the preview
Don't miss anything!

















Computational Exercise: Step through the examples in the lecture-18.R file on the class website.
- 29 March 36-402, Advanced Data Analysis Let’s sum up PCA. We start with n different p-dimensional vectors as our data, i.e., each observation as p numerical variables. We want to reduce the number of dimensions to something more manageable, say q. The principal components of the data are the q orthogonal directions of greatest variance in the original p-dimensional space; they can be found by taking the top q eigenvectors of the sample covariance matrix. Principal components analysis summarizes the data vectors by projecting them on to the principal components. All of this is purely an algebraic undertaking; it involves no probabilistic assumptions whatsoever. It also supports no statistical inferences — saying nothing about the population or stochastic process which made the data, it just summarizes the data. How can we add some probability, and so some statistics? And what does that let us do? Start with some notation. X is our data matrix, with n rows for the different observations and p columns for the different variables, so Xij is the value of variable j in observation i. Each principal component is a vector of length p, and there are p of them, so we can stack them together into a p × p matrix, say w. Finally, each data vector has a projection on to each principal component, which we collect into an n × p matrix F. Then
X = Fw (1) [n × p] = [n × p][p × p]
where I’ve checked the dimensions of the matrices underneath. This is an exact equation involving no noise, approximation or error, but it’s kind of useless; we’ve replaced p-dimensional vectors in X with p-dimensional vectors in F. If we keep only to q < p largest principal components, that corresponds to dropping columns from F and rows from w. Let’s say that the truncated matrices are Fq and wq. Then
X ≈ Fq wq (2) [n × p] = [n × q][q × p]
The error of approximation — the difference between the left- and right- hand- sides of Eq. 2 — will get smaller as we increase q. (The line below the equation is a sanity-check that the matrices are the right size, which they are. Also, at this point the subscript qs get too annoying, so I’ll drop them.) We can of course make the two sides match exactly by adding an error or residual term on the right:
X = Fw + (3)
where has to be an n × p matrix. Now, Eq. 3 should look more or less familiar to you from regression. On the left-hand side we have a measured outcome variable (X), and on the right-hand side we have a systematic prediction term (Fw) plus a residual (). Let’s run
There is another route from PCA to the factor model, which many people like but which I find less compelling; it starts by changing the objectives. PCA aims to minimize the mean-squared distance from the data to their projects, or what comes to the same thing, to preserve variance. But it doesn’t preserve correlations. That is, the correlations of the features of the image vectors are not the same as the correlations among the features of the original vectors (unless q = p, and we’re not really doing any data reduction). We might value those correlations, however, and want to preserve them, rather than the than trying to approximate the actual data.^1 That is, we might ask for a set of vectors whose image in the feature space will have the same correlation matrix as the original vectors, or as close to the same correlation matrix as possible while still reducing the number of dimensions. This leads to the factor model we’ve already reached, as we’ll see.
It’s common to represent factor models visually, as in Figure 1. This is an example of a graphical model, in which the nodes or vertices of the graph represent random variables, and the edges of the graph represent direct statis- tical dependencies between the variables. The figure shows the observables or features in square boxes, to indicate that they are manifest variables we can ac- tual measure; above them are the factors, drawn in round bubbles to show that we don’t get to see them. The fact that there are no direct linkages between the factors shows that they are independent of one another. From below we have the noise terms, one to an observable. Notice that not every observable is connected to every factor: this depicts the fact that some entries in w are zero. In the figure, for instance, X 1 has an arrow only from F 1 and not the other factors; this means that while w 11 = 0.87, w 21 = w 31 = 0. Drawn this way, one sees how the factor model is generative — how it gives us a recipe for producing new data. In this case, it’s: draw new, independent values for the factor scores F 1 , F 2 ,... Fq ; add these up with weights from w; and then add on the final noises 1 , 2 ,... p. If the model is right, this is a procedure for generating new, synthetic data with the same characteristics as the real data. In fact, it’s a story about how the real data came to be — that there really are some latent variables (the factor scores) which linearly cause the observables to have the values they do.
(^1) Why? Well, originally the answer was that the correlation coefficient had just been invented, and was about the only way people had of measuring relationships between variables. Since then it’s been propagated by statistics courses where it is the only way people are taught to measure relationships. The great statistician John Tukey once wrote “Does anyone know when the correlation coefficient is useful, as opposed to when it is used? If so, why not tell us?” (Tukey, 1954, p. 721).
-0.
0.73 0.
Figure 1: Graphical model form of a factor model. Circles stand for the unob- served variables (factors above, noises below), boxes for the observed features. Edges indicate non-zero coefficients — entries in the factor loading matrix w, or specific variances ψi. Arrows representing entries in w are decorated with those entries. Note that it is common to omit the noise variables in such diagrams, with the implicit understanding that every variable with an incoming arrow also has an incoming noise term.
One of the most important consequences of the factor model is that observable variables are correlated with each other solely because they are correlated with the hidden factors. To see how this works, take X 1 and X 2 from the diagram, and let’s calculate their covariance. (Since they both have variance 1, this is the same as their correlation.)
the data has a certain shape. To see what that is, pretend for right now that we can turn off the noise terms . The loading matrix w is a q × p matrix, so each row of w is a vector in p-dimensional space; call these vectors w~ 1 , ~w 2 ,... ~wq. Without the noise, our observable vectors would be linear combinations of these vectors (with the factor scores saying how much each vector contributes to the combination). Since the factors are orthogonal to each other, we know that they span a q-dimensional sub-space of the p-dimensional space — a line if q = 1, a plane if q = 2, in general a hyper-plane. If the factor model is true and we turn off noise, we would find all the data lying exactly on this hyper-plane. Of course, with noise we expect that the data vectors will be scattered around the hyper-plane; how close depends on the variance of the noise. But this is still a rather specific prediction about the shape of the data. A weaker prediction than “the data lie on a low-dimensional plane in the high-dimensional space” is “the data lie on some low-dimensional surface, pos- sibly curved, in the high-dimensional space”; there are techniques for trying to recover such surfaces, which can work even when factor analysis fails. But they are more complicated than factor analysis and outside the scope of this class. (Take 36-350.)
The roots of factor analysis go back to work by Charles Spearman just over a century ago (Spearman, 1904); he was trying to discover the hidden structure of human intelligence. His observation was that schoolchildren’s grades in different subjects were all correlated with each other. He went beyond this to observe a particular pattern of correlations, which he thought he could explain as follows: the reason grades in math, English, history, etc., are all correlated is performance in these subjects is all correlated with something else, a general or common factor, which he named “general intelligence”, for which the natural symbol was of course g or G. Put in a form like Eq. 4, Spearman’s model becomes
X = + Gw (12)
where G is an n × 1 matrix (i.e., a row vector) and w is a 1 × p matrix (i.e., a column vector). The correlation between feature i and G is just wi ≡ w 1 i, and, if i 6 = j, vij ≡ cov [Xi, Xj ] = wiwj (13)
where I have introduced vij as a short-hand for the covariance. Up to this point, this is all so much positing and assertion and hypothesis. What Spearman did next, though, was to observe that this hypothesis carried a very strong implication about the ratios of correlation coefficients. Pick any
four distinct features, i, j, k, l. Then, if the model (12) is true,
vij /vkj vil/vkl
wiwj /wkwj wiwl/wkwl
wi/wk wi/wk
The relationship vij vkl = vilvkj (17)
is called the “tetrad equation”, and we will meet it again later when we consider methods for causal discovery. In Spearman’s model, this is one tetrad equation for every set of four distinct variables. Spearman found that the tetrad equations held in his data on school grades (to a good approximation), and concluded that a single general factor of intel- ligence must exist. This was, of course, logically fallacious. Later work, using large batteries of different kinds of intelligence tests, showed that the tetrad equations do not hold in general, or more exactly that departures from them are too big to explain away as sampling noise. (Recall that the equations are about the true correlations between the variables, but we only get to see sample correlations, which are always a little off.) The response, done in an ad hoc way by Spearman and his followers, and then more system- atically by Thurstone, was to introduce multiple factors. This breaks the tetrad equation, but still accounts for the correlations among features by saying that features are really directly correlated with factors, and uncorrelated conditional on the factor scores.^3 Thurstone’s form of factor analysis is basically the one people still use — there have been refinements, of course, but it’s mostly still his method.
The factor model introduces a whole bunch of new variables to explain the observables: the factor scores F, the factor loadings or weights w, and the observable-specific variances ψi. The factor scores are specific to each individual, and individuals by assumption are independent, so we can’t expect them to really generalize. But the loadings w are, supposedly, characteristic of the population. So it would be nice if we could separate estimating the population parameters from estimating the attributes of individuals; here’s how. Since the variables are centered, we can write the covariance matrix in terms of the data frames:
v = E
n
(^3) You can (and should!) read the classic “The Vectors of Mind” paper (Thurstone, 1934) online.
Summarizing, we really have p(p − 1)/2 degrees of freedom in v, and pq − q(q −1)/2 degrees of freedom in w. If these two match, then there is (in general) a unique solution which will give us w. But in general they will not be equal; then what? Let us consider the two cases.
More unknowns (free parameters) than equations (constraints) This is fairly straightforward: there is no unique solution to Eq. 23; instead there are infinitely many solutions. It’s true that the loading matrix w does have to satisfy some constraints, that not just any w will work, so the data does give us some information, but there is a continuum of different parameter settings which are all match the covariance matrix perfectly. (Notice that we are working with the population parameters here, so this isn’t an issue of having only a limited sample.) There is just no way to use data to decide between these different pa- rameters, to identify which one is right, so we say the model is unidentifiable. Most software for factor analysis, include R’s factanal function, will check for this and just refuse to fit a model with too many factors relative to the number of observables.
More equations (constraints) than unknowns (free parameters) This is more interesting. In general, systems of equations like this are overdeter- mined, meaning that there is no way to satisfy all the constraints at once, and there isn’t even a single solution. It’s just not possible to write an arbitrary covariance matrix v among, say, seven variables in terms of, say, a one-factor model (as p(p−1)/2 = 7(7−1)/2 = 21 > 7(1)−1(1−1)/2 = 7 = pq−q(q−1)/2). But it is possible for special covariance matrices. In these situations, the factor model actually has testable implications for the data — it says that only certain covariance matrices are possible and not others. For example, we saw above that the one-fator model implies the tetrad equations must hold among the observ- able covariances; the constraints on v for multiple-factor models are similar in kind but more complicated algebraically. By testing these implications, we can check whether or not the our favorite factor model is right.^5 Now we don’t know the true, population covariance matrix v, but we can estimate it from data, getting an estimate v̂. The natural thing to do then is to equate this with the parameters and try to solve for the latter:
̂ v = ψ̂ + ŵ T^ ŵ (24)
The book-keeping for degrees of freedom here is the same as for Eq. 23. If q is too large relative to p, the model is unidentifiable; if it is too small, the matrix equation can only be solved if ̂v is of the right, restricted form, i.e., if the model is right. Of course even if the model is right, the sample covariances are the true covariances plus noise, so we shouldn’t expect to get an exact match, but
(^5) Actually, we need to be a little careful here. If we find that the tetrad equations don’t hold, we know a one-factor model must be wrong. We could only conclude that the one-factor model must be right if we found that the tetrad equations held, and that there were no other models which implied those equations; but, as we’ll see, there are.
we can try in various way to minimize the discrepancy between the two sides of the equation.
Remember that in Spearman’s model with a single general factor, the covari- ance between observables i and j in that model is the product of their factor weightings: vij = wiwj (25)
The exception is that vii = w^2 i + ψi, rather than w^2 i. However, if we look at u = v − ψ, that’s the same as v off the diagonal, and a little algebra shows that its diagonal entries are, in fact, just w^2 i. So if we look at any two rows of U, they’re proportional to each other:
uij =
wi wk ukj (26)
This means that, when Spearman’s model holds true, there is actually only one linearly-independent row in in u. Recall from linear algebra that the rank of a matrix is how many linearly independent rows it has.^6 Ordinarily, the matrix is of full rank, meaning all the rows are linearly independent. What we have just seen is that when Spearman’s model holds, the matrix u is not of full rank, but rather of rank
where c is the matrix whose columns are the eigenvectors of u, and d is the diagonal matrix whose entries are the eigenvalues. That is, if we use all p eigen- vectors, we can reproduce the covariance matrix exactly. Suppose we instead use cq, the p × q matrix whose columns are the eigenvectors going with the q largest eigenvalues, and likewise make dq the diagonal matrix of those eigenval- ues. Then cqdqcqT^ will be a symmetric positive p × p matrix. This is a matrix of rank q, and so can only equal u if the latter also has rank q. Otherwise, it’s an approximation which grows more accurate as we let q grow towards p, and, at any given q, it’s a better approximation to u than any other rank-q matrix. This, finally, is the precise sense in which factor analysis tries preserve correlations, as opposed to principal components trying to preserve variance.
(^6) We could also talk about the columns; it wouldn’t make any difference.
The “predicted” covariance matrix ˜v in the last line is exactly right on the diagonal (by construction), and should be closer off-diagonal than anything else we could do with the same number of factors. However, our guess as to u depended on our initial guess about ψ, which has in general changed, so we can try iterating this (i.e., re-calculating cq and dq), until we converge.
It has probably not escaped your notice that the estimation procedure above requires a starting guess as to ψ. This makes its consistency somewhat shaky. (If we continually put in ridiculous values for ψ, there’s no reason to expect that ŵ → w, even with immensely large samples.) On the other hand, we know from our elementary statistics courses that maximum likelihood estimates are generally consistent, unless we choose a spectacularly bad model. Can we use that here? We can, but at a cost. We have so far got away with just making assumptions about the means and covariances of the factor scores F. To get an actual likelihood, we need to assume something about their distribution as well. The usual assumption is that Fik ∼ N (0, 1), and that the factor scores are independent across factors k = 1,... q and individuals i = 1,... n. With this assumption, the features have a multivariate normal distribution X~i ∼ N (0, ψ + wT^ w). This means that the log-likelihood is
np 2
log 2π −
n 2
log |ψ + wT^ w| −
n 2
tr
(ψ + wT^ w)
− 1 ̂ v
where tr a is the trace of the matrix a, the sum of its diagonal elements. Notice that the likelihood only involves the data through the sample covariance matrix ̂ v — the actual factor scores F are not needed for the likelihood. One can either try direct numerical maximization, or use a two-stage pro- cedure. Starting, once again, with a guess as to ψ, one finds that the optimal choice of ψ^1 /^2 wT^ is given by the matrix whose columns are the q leading eigen- vectors of ψ^1 /^2 v̂ ψ^1 /^2. Starting from a guess as to w, the optimal choice of ψ is given by the diagonal entries of ̂v − wT^ w. So again one starts with a guess about the unique variances (e.g., the residuals of the regressions) and iterates to convergence.^7 The differences between the maximum likelihood estimates and the “prin- cipal factors” approach can be substantial. If the data appear to be normally distributed (as shown by the usual tests), then the additional efficiency of max- imum likelihood estimation is highly worthwhile. Also, as we’ll see next time, it is a lot easier to test the model assumptions is one uses the MLE.
(^7) The algebra is tedious. See section 3.2 in Bartholomew (1987) if you really want it. (Note that Bartholomew has a sign error in his equation 3.16.)
The probably the best method for estimating factor scores is the “regression” or “Thomson” method, which says
F̂ir =
j
Xij bij (35)
and seeks the weights bij which will minimize the mean squared error, E[(F̂ir − Fir )^2 ]. You can work out the bij as an exercise, assuming you know w.
Recall from linear algebra that a matrix o is orthogonal if its inverse is the same as its transpose, oT^ o = I. The classic examples are rotation matrices. For instance, to rotate a two-dimensional vector through an angle α, we multiply it by
rα =
cos α − sin α sin α cos α
The inverse to this matrix must be the one which rotates through the angle −α, r− α 1 = r−α, but trigonometry tells us that r−α = rTα. To see why this matters to us, go back to the matrix form of the factor model, and insert an orthogonal q × q matrix and its transpose:
X = + Fw (37) = + FooT^ w (38) = + Hy (39)
We’ve changed the factor scores to H ≡ Ho, and we’ve changed the factor loadings to y ≡ oT^ w, but nothing about the features has changed at all. We can do as many orthogonal transformations of the factors as we like, with no observable consequences whatsoever.^8 Statistically, the fact that different parameter settings give us the same ob- servational consequences means that the parameters of the factor model are unidentifiable. The rotation problem is, as it were, the revenant of having an ill-posed problem: we thought we’d slain it through heroic feats of linear algebra, but it’s still around and determined to have its revenge.^9
(^8) Notice that the log-likelihood only involves wT (^) w, which is equal to wT (^) ooT (^) w = yT (^) y, so even assuming Gaussian distributions doesn’t let us tell the difference between the original and the transformed variables. (^9) Remember that we obtained the loading matrix w as a solution to wT (^) w = u, that is to we got w as a kind of matrix square root of the reduced correlation matrix. For a real number u there are two square roots, i.e., two numbers w such that w × w = u, namely the usual w = √ u and w = − √ u, because (−1) × (−1) = 1. Similarly, whenever we find one solution to wT^ w = u, oT^ w is another solution, because ooT^ = I. So while the usual “square root” of u is w = dq^1 /^2 c, for any orthogonal matrix oT^ dq^1 /^2 c will always work just as well.
rather than just fitting a multivariate Gaussian means betting that either this bias is really zero, or that, with the amount of data on hand, the reduction in variance outweighs the bias. (I haven’t talked about estimated errors in the parameters of a factor model. The easiest way to obtain these is through either the jack-knife method — leave each observation out, re-estimate the model, and look at the distribution of re- estimates around the full-data estimate — or the bootstrap method — randomly re-sample the data, re-estimate, and again look at the distribution around the full-data estimate.)
How many factors should we use? All the tricks people use for the how-many- principal-components question can be tried here, too, with the obvious modifi- cations. However, some other answers can also be given, using the fact that the factor model does make predictions, unlike PCA.
2[(Θ)̂ −(θ̂)] ∼ χ^2 r−s
when the data came from the small model. (Here ` is the log likelihood function.) The general regularity conditions needed for this to hold apply to Gaussian factor models, so we can test whether one factor is enough, two, etc. (Said another way, adding another factor never reduces the likelihood, but the equation tells us how much to expect the log-likelihood to go up when the new factor really adds nothing and is just over-fitting the noise.) The reasons for the χ^2 distribution of the log likelihood ratio are a bit of a long story (involving the Fisher information matrix, etc.), but a rough version goes as follows. The big model has r − s more parameters than the
small one; we can always arrange to write things so that the small model is the big one, with those r − s extra parameters set to zero. The central limit theorem gives us Gaussian errors for each of the parameter estimates, but without bias, so they’re mean-zero Gaussians. Taylor-expand the log likelihood in the estimation errors: the first-order terms all cancel, leaving second order terms. (The 2 on the left-hand side began as the 1 /2 in front of the quadratic terms in the Taylor series.) The square of a standard Gaussian is a χ^2 random variable, so the log likelihood ratio is a sum of things proportional to χ^21 s. In fact, the second derivatives in the Taylor expansion exactly cancel the standard deviations of the Gaussian estimation errors, leaving us r − s summed, independent χ^21 terms, or one χ^2 r−s. Determining q by getting the smallest one without a significant result in a likelihood ratio test is fairly traditional, but statistically messy.^11 To raise a subject we’ll return to, if the true q > 1 and all goes well, we’ll be doing lots of hypothesis tests, and making sure this compound procedure works reliably is harder than controlling any one test. Perhaps more worrisomely, calculating the likelihood relies on distributional assumptions for the factor scores and the noises, which are hard to check for latent variables.
A natural impulse, when looking at something like Figure 1, is to reify the factors, and to treat the arrows causally: that is, to say that there really is some variable corresponding to each factor, and that changing the value of that variable will change the features. For instance, one might want to say that there is a real, physical variable corresponding to the factor F 1 , and that increasing this by one standard deviation will, on average, increase X 1 by 0.87 standard deviations, decrease X 2 by 0.75 standard deviations, and do nothing to the other features. Moreover, changing any of the other factors has no effect on X 1. Sometimes all this is even right. How can we tell when it’s right? (^11) Suppose q is really 1, but by chance that gets rejected. Whether q = 2 gets rejected in term is not independent of this!
The rotation problem does not rule out the idea that checking the fit of a factor model would let us discover how many hidden causal variables there are.
Suppose we have two distributions with probability densities f 0 (x) and f 1 (x). Then we can define a new distribution which is a mixture of them, with density fα(x) = (1 − α)f 0 (x) + αf 1 (x), 0 ≤ α ≤ 1. The same idea works if we combine more than two distributions, so long as the sum of the mixing weights sum to one (as do α and 1 − α). We will look more later at mixture models, which provide a very flexible and useful way of representing complicated probability distributions. They are also a probabilistic, predictive alternative to the kind of clustering techniques we’ve seen before this: each distribution in the mixture is basically a cluster, and the mixing weights are the probabilities of drawing a new sample from the different clusters.^12 I bring up mixture models here because there is a very remarkable result: any linear, Gaussian factor model with k factors is equivalent to some mixture model with k +1 clusters, in the sense that the two models have the same means and covariances (Bartholomew, 1987, pp. 36–38). Recall from above that the likelihood of a factor model depends on the data only through the correlation matrix. If the data really were generated by sampling from k + 1 clusters, then a model with k factors can match the covariance matrix very well, and so get a very high likelihood. This means it will, by the usual test, seem like a very good fit. Needless to say, however, the causal interpretations of the mixture model and the factor model are very different. The two may be distinguishable if the clusters are well-separated (by looking to see whether the data are unimodal or not), but that’s not exactly guaranteed. All of which suggests that factor analysis can’t really tell us whether we have k continuous hidden causal variables, or one discrete hidden variable taking k+ values.
We have been working with fewer factors than we have features. Suppose that’s not true. Suppose that each of our features is actually a linear combination of a lot of variables we don’t measure:
Xij = ηij +
∑^ q
k=
AikTkj = ηij + A~i · T~j (43)
where q p. Suppose further that the latent variables Aik are totally inde- pendent of one another, but they all have mean 0 and variance 1; and that the noises ηij are independent of each other and of the Aik, with variance φj ; and the Tkj are independent of everything. What then is the covariance between Xia
(^12) We will get into mixtures in considerable detail in the next lecture.
and Xib? Well, because E [Xia] = E [Xib] = 0, it will just be the expectation of the product of the features:
E [XiaXib] (44) = E
(ηia + A~i · T~a)(ηib + A~i · T~b)
= E [ηiaηib] + E
ηia A~i · T~b
ηib A~i · T~a
( A~i · T~a)( A~i · T~b)
[( (^) q ∑
k=
AikTka
) ( (^) q ∑
l=
AilTlb
k,l
AikAilTkaTlb
k,l
E [AikAil] TkaTlb (49)
k,l
E [AikAil] E [TkaTlb] (50)
∑^ q
k=
E [TkaTkb] (51)
where to get the last line I use the fact that E [AikAil] = 1 if k = l and = 0 otherwise. If the coefficients T are fixed, then the last expectation goes away and we merely have the same kind of sum we’ve seen before, in the factor model. Instead, however, let’s say that the coefficients T are themselves random (but independent of A and η). For each feature Xia, we fix a proportion za between 0 and 1. We then set Tka ∼ Bernoulli(za), with Tka |= Tlb unless k = l and a = b. Then E [TkaTkb] = E [Tka] E [Tkb] = zazb
and E [XiaXib] = qzazb
Of course, in the one-factor model,
E [XiaXib] = wawb
So this random-sampling model looks exactly like the one-factor model with factor loadings proportional to za. The tetrad equation, in particular, will hold. Now, it doesn’t make a lot of sense to imagine that every time we make an observation we change the coefficients T randomly. Instead, let’s suppose that they are first generated randomly, giving values Tkj , and then we generate feature values according to Eq. 43. The covariance between Xia and Xib will be
∑q k=1 TkaTkb. But this is a sum of IID random values, so by the law of large numbers as q gets large this will become very close to qzazb. Thus, for nearly all choices of the coefficients, the feature covariance matrix should come very