


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Class: Introduction to Biostatistics; Subject: STATISTICS; University: University of Wisconsin - Madison; Term: Fall 2008;
Typology: Study notes
1 / 4
This page cannot be seen from the preview
Don't miss anything!



Random Variables X , Y: numerical (Contrast with § 6.3.1.)
How can the association between X and Y (if any exists) be
characterized and measured?
mathematically modeled via an equation, i.e., Y = f ( X )?
Recall: μ X = Mean( X ) = E [ X ] μ Y = Mean( Y ) = E [ Y ]
Definition: Population Covariance of X , Y
SAMPLE, size n Recall:
1
1 n i i
x x n (^) =
1
1 n i i
y y n (^) =
2 2 1
x 1
n i i
s x x n (^) =
2 2 1
y 1
n i i
s y y n (^) =
Definition: Sample Covariance of X , Y
1
xy 1
n i i i
s x x y y n (^) =
Note: Whereas s (^) x^2 ≥ 0 and s (^) y^2 ≥ 0, s (^) xy is unrestricted in sign.
For the sake of simplicity, let us assume that the predictor variable^ X^ is nonrandom (i.e., deterministic), and that the response variable Y is random. (Although, the subsequent techniques can be extended to random X as well.)
Example: X = fat (grams), Y = cholesterol level (mg/dL)
Suppose the following sample of n = 5 data values is obtained and graphed in a scatterplot , along with some accompanying summary statistics:
x = 80
y = 240
sx^2 = 250
sy^2 = 1750
Sample Covariance
sx y = 1
As the name implies, the variance measures the extent to which a single variable varies (about its mean). Similarly, the covariance measures the extent to which two variables vary (about their individual means), with respect to each other.
Before moving on to the next section, some important details are necessary in order to provide a more formal context for this type of problem. In our example, the response variable of interest is cholesterol level Y , which presumably has some overall probability distribution in the study population. The mean cholesterol level of this population can therefore be denoted μ Y – or, recall, expectation E [ Y ] – and
estimated by the “grand mean” y = 240. Note that no information about X is used.
Now we seek to characterize the relation (if any) between cholesterol level Y and fat intake X in this population, based on a random sample using n = 5 fat intake values (i.e., x 1 = 60, x 2 = 70, x 3 = 80, x 4 = 90, x 5 = 100). Each of these fixed xi values can be regarded as representing a different amount of fat grams consumed by a subpopulation of individuals, whose cholesterol levels Y , conditioned on that value of X = xi , are assumed to be normally distributed. The conditional mean
μ Y (^) | (^) X = 70
μ Y (^) | (^) X = 80
μ Y (^) | (^) X = 90
μ Y (^) | (^) X = 100
μ Y (^) | (^) X = 60
We can consider n = 5 subpopulations, each of whose cholesterol levels Y are normally distributed, and whose means are conditioned on X = 60, 70, 80, 90, 100 fat grams, respectively.
σ
σ
σ
σ
σ
=
all equal (to σ 2 ), and that they are independent of one another.) If no relation between X and Y exists, we would expect to see no organized variation in Y as X changes, and all of these conditional means would either be uniformly “scattered” around – or exactly equal to – the unconditional mean (^) μ Y ; recall the discussion on
the preceding page. But if there is a true relation between X and Y , then it becomes important to characterize and model the resulting (nonzero) variation.