




























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Chapter 3 Material Type: Notes; Professor: Davidian; Class: Applied Longitudianal Data Analysis; Subject: Statistics; University: North Carolina State University; Term: Unknown 1989;
Typology: Study notes
1 / 36
This page cannot be seen from the preview
Don't miss anything!





























As we saw in Chapter 1, a natural way to think about repeated measurement data is as a series of random vectors, one vector corresponding to each unit. Because the way in which these vectors of measurements turn out is governed by probability, we need to discuss extensions of usual univari- ate probability distributions for (scalar) random variables to multivariate probability distributions governing random vectors.
First, it is wise to review the important concepts of random variable and probability distribution and how we use these to model individual observations.
RANDOM VARIABLE: We may think of a random variable Y as a characteristic whose values may vary. The way it takes on values is described by a probability distribution.
CONVENTION, REPEATED: It is customary to use upper case letters, e.g Y , to denote a generic random variable and lower case letters, e.g. y, to denote a particular value that the random variable may take on or that may be observed (data).
EXAMPLE: Suppose we are interested in the characteristic “body weight of rats” in the population of all possible rats of a certain age, gender, and type. We might let
Y = body weight of a (randomly chosen) rat
from this population. Y is a random variable.
We may conceptualize that body weights of rats are distributed in this population in the sense that some values are more common (i.e. more rats have them) than others. If we randomly select a rat from the population, then the chance it has a certain body weight will be governed by this distribution of weights in the population. Formally, values that Y may take on are distributed in the population according to an associated probability distribution that describes how likely the values are in the population.
In a moment, we will consider more carefully why rat weights we might see vary. First, we recall the following.
(POPULATION) MEAN AND VARIANCE: Recall that the mean and variance of a probability distribution summarize notions of “center” and “spread” or “variability” of all possible values. Consider a random variable Y with an associated probability distribution.
The population mean may be thought of as the average of all possible values that Y could take on, so the average of all possible values across the entire distribution. Note that some values occur more frequently (are more likely) than others, so this average reflects this. We write
E(Y ). (3.1)
to denote this average, the population mean. The expectation operator E denotes that the “averaging” operation over all possible values of its argument is to be carried out. Formally, the average may be thought of as a “weighted” average, where each possible value is represented in accordance to the probability with which it occurs in the population. The symbol “μ” is often used.
The population mean may be thought of as a way of describing the “center” of the distribution of all possible values. The population mean is also referred to as the expected value or expectation of Y.
Recall that if we have a random sample of observations on a random variable Y , say Y 1 ,... , Yn, then the sample mean is just the average of these:
Y = n−^1 ∑^ n j=
Yj.
For example, if Y = rat weight, and we were to obtain a random sample of n = 50 rats and weigh each, then Y represents the average we would obtain.
The population variance may be thought of as measuring the spread of all possible values that may be observed, based on the squared deviations of each value from the “center” of the distribution of all possible values. More formally, variance is based on averaging squared deviations across the population, which is represented using the expectation operator, and is given by
var(Y ) = E{(Y − μ)^2 }, μ = E(Y ). (3.2)
(3.2) shows the interpretation of variance as an average of squared deviations from the mean across the population, taking into account that some values are more likely (occur with higher probability) than others.
GENERAL FACTS: If b is a fixed scalar and Y is a random variable, then
SOURCES OF VARIATION: We now consider why the values of a characteristic that we might observe vary. Consider again the rat weight example.
Y = μ + b, (3.3)
where b is a random variable, with population mean E(b) = 0 and variance var(b) = σ b^2 , say. Here, Y is “decomposed” into its mean value (a systematic component) and a random devia- tion b that represents by how much a rat weight might deviate from the mean rat weight due to inherent biological factors. (3.3) is a simple statistical model that emphasizes that we believe rat weights we might see vary because of biological phenomena. Note that (3.3) implies that E(Y ) = μ and var(Y ) = σ^2 b.
There are still further sources of variation that we could consider; we defer discussion to later in the course. For now, the important message is that, in considering statistical models, it is critical to be aware of different sources of variation that cause observations to vary. This is especially important with longitudinal data, as we will see.
The model thus says that, at each xj , there is a population of possible Yj values we might see, with mean β 0 + β 1 xj and variance σ^2. We can represent this pictorially by considering Figure 2.
Figure 2: Simple linear regression
x
y
0 2 4 6 8 10
3
4
5
6
7
8
PSfrag replacements
μ σ^21 σ^22
“ERROR”: An unfortunate convention in the literature is that the ≤j are referred to as errors, which causes some people to believe that they represent solely deviation due to measurement error. We prefer the term deviation to emphasize that Yj values may deviate from β 0 + β 1 xj due to the combined effects of several sources (but not limited to measurement error).
INDEPENDENCE: An important assumption for simple linear regression and, indeed, more general problems, is that the random variables Yj , or equivalently, the ≤j , are independent.
(Statistical) independence is a formal statistical concept with an important practical interpretation. In particular, in our simple linear regression model, this says that the way in which Yj at xj takes on its values is completely unrelated to the way in which Yj′^ observed at another position xj′^ takes on its values. This is certainly a reasonable assumption in many situations.
The consequence of independence is that we may think of data on an observation-by-observation basis; because the behavior of each observation is unrelated to that of others, we may talk about each one in its own right, without reference to the others.
Although this way of thinking may be relevant for regression problems where the data were collected according to a scheme like that in the example above, as we will see, it may not be relevant for longitudinal data.
As we have already mentioned, when several observations are taken on the same unit, it will be convenient, and in fact, necessary, to talk about them together. We thus must extend our way of thinking about random variables and probability distributions.
RANDOM VECTOR: A random vector is a vector whose elements are random variables. Let
Yn
be a (n × 1) random vector.
E(Yj ) = μj , var(yj ) = E{(Yj − μj )^2 } = σ^2 j.
We might furthermore have that Yj is normally distributed; i.e.
Yj ∼ N (μj , σ j^2 ).
Inspection of (3.5) shows
(Yj − μj )(Yk − μk) (3.6)
will tend to be positive for most of the pairs of values in the population. Thus, the average in (3.5) will likely be positive.
Thus, the quantity of covariance defined in (3.5) makes intuitive sense as a measure of how “associated” values of Yj are with values of Yk.
cov(Yj , Yj ) = E{(Yj − μj )(Yj − μj )} = var(Yj ) = σ^2 j.
var(Yj + Yk) = var(Yj ) + var(Yk) + 2cov(Yj , Yk).
That is, the variance of the population consisting of all possible values of the sum Yj + Yk is the sum of the variances for each population, adjusted by how “associated” the two values are. Note that if Yj and Yk are independent, var(Yj + Yk) = var(Yj ) + var(Yk).
We now see how all of this information is summarized.
EXPECTATION OF A RANDOM VECTOR: For an entire n-dimensional vector random Y , we sum- marize the means for each element in a vector
μ =
E(Yn)
μ 1 μ 2 ... μn
We define the expected value or mean of Y as
E(Y ) = μ;
the expectation operation is applied to each element in the vector Y , yielding the vector μ of means.
RANDOM MATRIX: A random matrix is simply a matrix whose elements are random variables; we will see a specific example of importance to us in a moment. Formally, if Y is a (r × c) matrix with element Yjk, each a random variable, then each element has an expectation, E(Yjk) = μjk, say. Then the expected value or mean of Y is defined as the corresponding matrix of means; i.e.
E(Y 11 ) E(Y 12 ) · · · E(Y 1 c) ... ... ... ... E(Yr 1 ) E(Yr 2 ) · · · E(Yrc)
.
COVARIANCE MATRIX: We now see how this concept is used to summarize information on covariance among the elements of a random vector. Note that
(Y − μ)(Y − μ)′^ =
(Y 1 − μ 1 )^2 (Y 1 − μ 1 )(Y 2 − μ 2 ) · · · (Y 1 − μ 1 )(Yn − μn) (Y 2 − μ 2 )(Y 1 − μ 1 ) (Y 2 − μ 2 )^2 · · · (Y 2 − μ 2 )(Yn − μn) ... ...... ... (Yn − μn)(Y 1 − μ 1 ) (Yn − μn)(Y 2 − μ 2 ) · · · (Yn − μn)^2
which is a random matrix.
CORRELATION: It is informative to separate the information on “spread” contained in variances σ j^2 from that describing “association.” Thus, we define a particular measure of association that takes into account the fact that different elements of Y may vary differently on their own.
The population correlation coefficient between Yj and Yk is defined as
ρjk = √σjk σ^2 j
√ σ^2 k
Of course, σj =
√ σ j^2 is the population standard deviation of Yj , on the same scale of measurement as Yj , and similarly for Yk.
CORRELATION MATRIX: It is customary to summarize the information on correlations in a matrix: The correlation matrix Γ is defined as
1 ρ 12 · · · ρ 1 n ρ 21 1 · · · ρ 1 n ... ...... ... ρn 1 ρn 2 · · · 1
For now, we use the symbol Γ to denote the correlation matrix of a random vector.
ALTERNATIVE REPRESENTATION OF COVARIANCE MATRIX: Note that knowledge of the vari- ances σ^21 ,... , σ n^2 and the correlation matrix Γ is equivalent to knowledge of Σ, and vice versa. It is often easier to think of associations among random variables on the unitless correlation scale than in terms of covariance; thus, it is often convenient to write the covariance matrix another way that presents the correlations explicitly.
Define the “standard deviation” matrix
σ 1 0 · · · 0 0 σ 2 · · · 0 ... ...... ... 0 0 · · · σn
The “1/2” reminds us that this is a diagonal matrix with the square roots of the variances on the diagonal. Then it may be verified that (try it)
T 1 /^2 ΓT 1 /^2 = Σ. (3.7)
The representation (3.7) will prove convenient when we wish to discuss associations implied by models for longitudinal data in terms of correlations. Moreover, it is useful to appreciate (3.7), as it allows calculations involving Σ that we will see later to be implemented easily on a computer.
GENERAL FACTS: As we will see later, we will often be interested in linear combinations of the elements of a random vector Y ; that is, functions of the form
c 1 Y 1 + · · · cnYn,
which may be written succinctly as c′Y , where c is the column vector
c =
c 1 ... cn
It is possible using facts on the multiplication random variables by scalars (see above) and the definitions of μ and Σ to show that E(c′Y ) = c′μ var(c′Y ) = c′Σc.
(Try to verify these.)
If we have a random vector Y with elements that are continuous random variables, then, it is natural to consider the normal distribution as a probability model for each element Yj. However, as we have discussed, we are likely to be concerned about associations among the elements of Y. Thus, it does not suffice to describe each of the elements Yj separately; rather, we seek a probability model that describes their joint behavior. As we have noted, such probability distributions are called multivariate for obvious reasons.
The multivariate normal distribution is the extension of the normal distribution of a single random variable to a random vector composed of elements that are each normally distributed. Through its form, it naturally takes into account correlation among the elements of Y ; moreover, it gives a basis for a way of thinking about an extension of “least squares” that is relevant when observations are not independent but rather are correlated.
NORMAL PROBABILITY DENSITY: Recall that, for a random variable y, the normal distribution has probability density function
f (y) = (^) (2π)^11 / (^2) σ exp
{ −(y − μ)^2 /(2σ^2 )
}
. (3.8)
This function has the shape shown in Figure 3. The shape will vary in terms of “center” and “spread” according to the values of the population mean μ and variance σ^2 (e.g. recall Figure 1).
Figure 3: Normal density function with mean μ.
PSfrag replacements
μ
σ^21 σ^22
Several features are evident from the form of (3.8):
− (y^ −^ μ)
2 σ^2 = (y^ −^ μ)(σ
(^2) )− (^1) (y − μ). (3.9)
Note that this term depends on the squared deviation (y − μ)^2.
SIMPLE LINEAR REGRESSION: For now, to appreciate this form and its extension, consider the method of least squares for fitting a simple linear regression. (The same considerations apply to multiple linear regression, which will be discussed later in this chapter.) As before, at each fixed value x 1 ,... , xn, there is a corresponding random variable Yj , j = 1,... , n, which is assumed to arise from
Yj = β 0 + β 1 xj + ≤j , β = (β 0 , β 1 )′
The further assumption is that Yj are each normally distributed with means μj = β 0 +β 1 xj and variance σ^2.
MULTIVARIATE NORMAL PROBABILITY DENSITY: The joint probability distribution that is the extension of (3.8) to a (n × 1) random vector Y , each of whose components are normally distributed (but possibly associated), is given by
f (y) = (^) (2π^1 )n/ 2 |Σ|−^1 /^2 exp
{ −(y − μ)′Σ−^1 (y − μ)/ 2
} (3.11)
(y − μ)′Σ−^1 (y − μ). (3.12)
Note that this is a quadratic form, so it is a scalar function of the elements of (y − μ) and Σ−^1. Specifically, if we refer to the elements of Σ−^1 as σjk, i.e.
σ^11 · · · σ^1 n ...... ... σn^1 · · · σnn
then we may write
(y − μ)′Σ−^1 (y − μ) = ∑^ n j=
∑^ n k=
σjk(yj − μj )(yk − μk). (3.13)
Of course, the elements σjk^ will be complicated functions of the elements σ j^2 , σjk of Σ, i.e. the variances of the Yj and the covariances among them.
BIVARIATE NORMAL DISTRIBUTION: To gain insight into this suspicion, and to get a better understanding of the multivariate distribution, it is instructive to consider the special case n = 2, the simplest example of a multivariate normal distribution (hence the name bivariate).
Here,
Y =
Y^1 Y 2
, μ =
μ^1 μ 2
, Σ =
σ^21 σ^12 σ 12 σ 22
.
Using the inversion formula for a (2 × 2) matrix given in Chapter 2,
Σ−^1 = (^) σ 2 1 1 σ^22 −^ σ 122
σ 22 −σ^12 −σ 12 σ^21
.
We also have that the correlation between Y 1 and Y 2 is given by
ρ 12 = (^) σσ 112 σ 2.
Using these results, it is an algebraic exercise to show that (try it!)
(y − μ)′Σ−^1 (y − μ) = (^1) −^1 ρ 2 12
{ (^) (y 1 −^ μ 1 )^2 σ^21 +
(y 2 − μ 2 )^2 σ^22 −^2 ρ^12
(y 1 − μ 1 ) σ 1
(y 2 − μ 2 ) σ 2
}
. (3.14)
Compare this expression to the general one (3.13).
Inspection of (3.14) shows that the quadratic form involves two components:
(y 2 − μ 2 )^2 σ 22. This sum alone is in the spirit of the sum of squared deviations in least squares, with the difference that each deviation is now weighted in accordance with its variance. This makes sense – because the variances of Y 1 and Y 2 differ, information on the population of Y 1 values is of “different quality” than that on the population of Y 2 values. If variance is “large,” the quality of information is poorer; thus, the larger the variance, the smaller the “weight,” so that information of “higher quality” receives more weight in the overall measure. Indeed, then, this is like a “distance measure,” where each contribution receives an appropriate weight.