BIOS 543 Second Review: Comparing Continuous Variables and Finding Relationships - Prof. A, Study notes of Data Analysis & Statistical Methods

A review for the second mid-term exam in a statistics course (bios 543) focusing on comparing means of continuous variables and looking for relationships between continuous variables. It covers descriptive statistics, continuous and categorical responses, hypothesis testing for means, and correlation analysis.

Typology: Study notes

Pre 2010

Uploaded on 02/12/2009

koofers-user-htd-1
koofers-user-htd-1 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Copyright © Al Best, 14 January, 2009. All rights reserved Rev 2/1
BIOS 543 Second Review
The second mid-term will cover the material on comparing the means of continuous variables and on
looking for relationships between continuous variables (Sections 5–10).
Continuous Responses
Which statistical method should I use? As we discussed in the first review (which covered how to
answer this question if the response is categorical) begin by considering the (random) response
variable and its measurement level (the Y and its modeling type).
Descriptive Statistics
No matter what hypotheses we test, we still have to summarize the data first. If the variable to be
summarized is continuous, first consider the shape of the distribution. Beyond visual inspection of
the histogram, the primary tool for deciding whether normality is warranted is the normal quantile
plot.
Don’t be too quick to reject normality; we have a strong preference for normality. Most of the
statistics we use are robust and can handle moderate departures from normality. In large samples we
need a very strong argument to ignore this preference. In smaller samples, unless there is a strong
case against it, normality may be justifiable. Recall the CLT.
Use the recommendations on Section 5, pages 17-18 to guide you. If normality is justifiable, then
report the n, mean, SD, and perhaps 95% CI’s. If the data are clearly not normal, then report the n,
median, and either the IQR or range.
Note also that even if the main focus is on the continuous response variable, there may also be
categorical variables that you need to describe. For instance, if you are comparing two groups (e.g.
males and females), then describe the groups.
Continuous Y
Categorical responses were covered earlier. That is, qualitative response variables with nominal
measurement level. Recall the first question:
Is the response categorical?
If the answer is, “No, it’s continuous1,” then the methods covered here are appropriate.
1 Continuous variables are variables where taking an average makes some sense. They must,
therefore, be numeric. In addition, the rank ordering of values may also be considered here (using
nonparametric methods).
pf3
pf4

Partial preview of the text

Download BIOS 543 Second Review: Comparing Continuous Variables and Finding Relationships - Prof. A and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

The second mid-term will cover the material on comparing the means of continuous variables and on looking for relationships between continuous variables (Sections 5–10).

Continuous Responses

Which statistical method should I use? As we discussed in the first review (which covered how to answer this question if the response is categorical) begin by considering the (random) response variable and its measurement level (the Y and its modeling type).

Descriptive Statistics

No matter what hypotheses we test, we still have to summarize the data first. If the variable to be summarized is continuous, first consider the shape of the distribution. Beyond visual inspection of the histogram, the primary tool for deciding whether normality is warranted is the normal quantile plot. Don’t be too quick to reject normality; we have a strong preference for normality. Most of the statistics we use are robust and can handle moderate departures from normality. In large samples we need a very strong argument to ignore this preference. In smaller samples, unless there is a strong case against it, normality may be justifiable. Recall the CLT. Use the recommendations on Section 5, pages 17-18 to guide you. If normality is justifiable, then report the n, mean, SD, and perhaps 95% CI’s. If the data are clearly not normal, then report the n, median, and either the IQR or range. Note also that even if the main focus is on the continuous response variable, there may also be categorical variables that you need to describe. For instance, if you are comparing two groups (e.g. males and females), then describe the groups.

Continuous Y

Categorical responses were covered earlier. That is, qualitative response variables with nominal measurement level. Recall the first question:

  • Is the response categorical? If the answer is, “No, it’s continuous^1 ,” then the methods covered here are appropriate.

(^1) Continuous variables are variables where taking an average makes some sense. They must,

therefore, be numeric. In addition, the rank ordering of values may also be considered here (using nonparametric methods).

If the answer is “yes” then go back to the first review. Next, the most difficult question.

What’s the question?

Next we consider the substantive question. Since the response is continuous, we’re probably interested in means but what about them? Questions either refer to comparisons or relationships. Looking to make a comparison? If you are interested in comparing your observed mean to something, then you may be interested in differences between your data and an external reference, or you may be interested in differences between the groups within your study.

  • I have an observed mean, is it different than some hypothesized mean? If the answer to this question is, “Yes” then use the methods covered in the section 7 handout for “One Sample Mean” where we compared an observed mean to a single hypothesized value.
  • I have two paired means, is there a difference between them? If two observations come from the same subject, the means are not independent. We must use the methods covered in section 7 “Two means from paired measurements.”
  • I have two independent groups *^ , is there a difference between the means? If the subjects in the groups are independent, then the means in the groups are independent. Probably we’ll use the equal variance t-test (see section 8 “Two sample means”) although other considerations may change this (like unequal variance or clearly non-normal data). Note that we [almost] never know the population standard deviation so we rarely use the z- statistic. We use one of the t-statistics or a nonparametric test, depending on our judgements as described in section 8, page 18.
  • I have two independent groups, is there a difference between the standard deviations (or variances)? Use the Brown-Forsythe test to compare variances (see section 8, pages 12-13). Looking for a relationship? Or you may have two (random) responses and you are interested in whether there is a relationship between the continuous values in these two variables. The step by step approach to answering these questions is given in section 10, starting on page 22.
  • (^) Note: If you are comparing the means of more than two groups, use the ANOVA methods covered

in Section 12 (which we have NOT covered yet).

How to use and interpret:

Grouped-values histogram Mean and median Range, variance, standard deviation The empirical rule Box plot Percentiles, quartiles Normal quantile plot Standard error of the mean (how is this different than the SD?) Confidence interval on the mean Correlation, slope of the straight line fit R-square, “variance accounted for”

Questions

What is the expected mean of the sample mean? What is the relationship between standard deviation of the data and the standard error of the sample mean? When is the sample mean normally distributed? How does the central limit theorem help? What problem does it solve? When does it make statistics easier? When would reporting a mean be not useful (not defendable)? When would reporting a standard deviation be not useful (not defendable)? When would a calculated CI on a mean be not useful (not defendable)? Under what circumstances is it OK to change the value of an outlier? Under what circumstances is it OK to remove unusual values from your dataset? In regression, when do we use the Prediction Interval and when do we use the Confidence Interval?