Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Hazards of Using Statistical Tests: Cautions on Confidence Intervals and Hypothesis Tests , Study notes of Probability and Statistics

This document, from a math 243 lecture file, discusses the hazards of using statistical tests, specifically confidence intervals and hypothesis tests. The author cautions against assuming a simple random sample, non-normality, and the importance of knowing the standard deviation. The document also covers specific cautions for confidence intervals and hypothesis tests, such as nonresponse and undercoverage, and the interpretation of p-values.

Typology: Study notes

Pre 2010

Uploaded on 07/29/2009

koofers-user-846
koofers-user-846 🇺🇸

10 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download Hazards of Using Statistical Tests: Cautions on Confidence Intervals and Hypothesis Tests and more Study notes Probability and Statistics in PDF only on Docsity!

Math 243: Lecture File 12

N. Christopher Phillips

7 May 2009

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 1 / 41

Hazards of using statistical tests

We will discuss things to watch out for when using z confidence intervals and z hypothesis tests.

Most of these hazards and warnings apply to most or all of the procedures we will encounter (one and two sample t procedures and one and two proportion z procedures, for both confidence intervals and hypothesis tests). These other procedures also have their own hazards and warnings, but we discuss now the ones which apply broadly.

(Note: I have included at the appropriate place here the so far unused material from lecture file 10.)

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 2 / 41

Summary

These apply to both confidence intervals and hypothesis tests:

The data must have been properly collected: simple random sample condition. The data must have been properly collected: proper design. More complicated experimental designs (such as stratified random sampling) require other procedures. The distribution is supposed to be normal. (This is special to the z procedures.) You must know σ.

Summary (continued)

This one applies only to confidence intervals:

The margin of error only covers errors due to randomness in sampling.

These apply only to hypothesis tests:

The P-value only covers errors due to randomness in sampling. You need judgement to decide how small a P-value is convincing. (In different language: How do you choose α?) Beware of hard and fast cutoffs on P-values. Statistically significant doesn’t mean important. If you run many tests, some of them will improperly reject the null hypothesis merely by chance. (This is a very important point that people seem to have trouble with.) Formulate your hypotheses before you do your experiment.

The data must have been properly collected: simple

random sample condition

The data must come from a simple random sample, or at least it must be reasonable to treat the sample as if it were a simple random sample.

In some cases we really have a simple random sample, and this condition is met.

In other cases, we don’t have a simple random sample (often, because there is no way to choose one), but we hope that our sample can be treated as one. Knowing when this is the case depends on specific knowledge of the subject area.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 5 / 41

Examples

Example 16.1 of the book: Students from an introductory psychology class can probably be safely treated as a simple random sample of young people for the purposes of experiments on vision, but students from an introductory sociology class can certainly not be safely treated as a simple random sample of young people for the purposes of attitude surveys. A collection of laboratory rats chosen in no particular way is often treated as a simple random sample of that strain of rat. What one can conclude about wild rats is more questionable.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 6 / 41

More examples

Repeated measurements can often be safely treated as a simple random sample from all possible measurements. See the book’s comment on trained tasters (under “Cautions about the z procedures”). There is nothing to make them unrepresentative. If the data were gathered from a convenience sample or a voluntary response sample, the z procedures are useless. No statistical procedure can compensate for such mistakes in data collection.

Exceptions

For example, the results of the excite.com online poll don’t tell you anything about the population of US adults, or even anything about the population of frequent visitors to the excite.com website. (This is a voluntary response sample. There is no reason to think that the ones that click on the poll are even representative of the people who frequently visit the site.)

Things may change if the population of interest is people who respond to online polls. (However, in this case, those who respond specifically to the excite.com online poll constitute a convenience sample.)

The data must have been properly collected: proper design

If your experiment doesn’t have a control group, the z procedures are useless (for the same reason as if you use a poor sample). The control group must be treated identically except for the specific treatment being considered. For unexpected examples of what this might mean, see the margin items “Don’t touch the plants” on page 388 (touching the plants for the purpose of making measurements, by itself, increased leaf damage by insects on some species, but decreased leaf damage in another species), “Scratch my furry ears” on page 222 (rabbits given friendly human attention had lower cholesterol), and “Dropping out” on page 390 (people dropped out of an experiment about whether a weight loss program or exercise program lowered cholesterol levels, and only the data from those who stayed was used in the analysis).

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 9 / 41

A problem from Midterm 2 Winter 2006

Professor Slump is studying a drug which is supposed to lower cholesterol levels. In a preliminary experiment, 10 patients take this drug for a month. The differences between their cholesterol levels at the beginning and at the end of the treatment are as follows:

− 20 − 3 17 − 15 2 − 37 − 12 − 17 − 8 − 9

Is it appropriate and safe to use the one sample t procedure on these data to find a 95% confidence interval for the mean change in cholesterol level resulting from treatment with this drug? Why or why not?

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 10 / 41

No.

The mathematical conditions for the t procedure (for a data set of this size, as we will see later: single peaked, roughly symmetric, no outliers) are satisfied.

However, the answer is no.

The experiment had no control group.

Double blind

If your experiment isn’t double blind, the results are suspect. For an unexpected example of how information affects outcomes, see the margin item “Death from superstition” on page 100. (The number 4 is unlucky to Japanese and Chinese, because of pronunciation similar to the word for “death”. Deaths from heart disease are significantly higher on the 4th of a month for Japanese and Chinese Americans, but not for whites. Sociologists attribute this to stress from an “unlucky” day.)

Side effects spoil double blind?

The article “Brawny Brains: Creatine pills may aid memory and cognition” (Science News, 16 Aug. 2003, vol. 164 no. 7, page 101), reports on a test of the effects of creatine on certain kinds of mental performance in humans.

The experiment described is a randomized double blind experiment in which each person serves as his own control (the treatments are switched half way through).

In the last paragraph, someone is reported as questioning the validity of the results. Creatine can have several side effects (bad breath, etc.), so, despite the design, some people getting creatine might have known they were not getting a placebo.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 13 / 41

More complicated experimental designs require other

procedures

The procedure we have learned does not apply to more complicated sampling methods (stratified random samples, block designs, etc.).

In this course, we will stick with procedures for simple designs. If you need more, take the next course.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 14 / 41

The distribution is supposed to be normal

The z procedures depend on the sampling distribution being (approximately) normal. For large enough samples, this is true for almost any distribution of the original data.

For moderate size samples, more caution is needed. Both strong skewness and outliers cause problems, and make the z procedures unsafe. From the book: “If outliers cannot be removed, ask your statistical consultant... .” Accordingly, explore your data first.

Normality condition (continued)

In practice, in the absence of outliers the z procedures are reasonably safe for moderate sample sizes for any reasonably symmetric distribution.

Read the book discussion. Later sections give more explicit suggestions on sample sizes.

Knowing the standard deviation

We will remove this (usually unrealistic) condition in the next chapters, when we consider the t procedures.

There are occasional situations in which σ is known. For example, you use a scientific instrument with known characteristics to make repeated measurement on a specimen for which the quantity being measured is unknown.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 17 / 41

Specific caution for confidence intervals

The margin of error covers only errors due to the randomness in the choice of a simple random sample.

It does not cover errors due to failure to obtain a simple random sample in the first place. For example, in opinion surveys, it does not cover errors due to: Nonresponse. Undercoverage (such as people who don’t have telephones, drivers licenses, or whatever, or omitting Alaska and Hawaii). Bias in the question (even if unintentional). Etc.

These are often more serious than the kind of error that is covered.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 18 / 41

Specific cautions for hypothesis tests

The P-value only covers errors due to randomness in sampling.

This is similar to the issues for confidence intervals on the previous slide. For hypothesis tests, these issues damage the P-value instead of the margin of error.

How small a P-value is convincing?

The general rule is: the more implausible you (or your intended audience) consider the alternate hypothesis to be, the smaller P has to be to convince. This means that you should choose the significance level α to be smaller.

Restated: to convince, you must persuade someone that it is more likely that the effect is real than that you chose, by bad luck, a highly unrepresentative sample. The more implausible your conclusion, the smaller the P-value you need.

Recall how it works

We test the null hypothesis, for example μ = 64, against an alternative hypothesis, say μ > 64. We choose a simple random sample, and compute x. The P-value is the probability that, if the null hypothesis is true, a simple random sample gives a value of x which is as extreme (here, is as large) as the one we got. H 0 : μ = 64. Ha: μ > 64.

Example: I know σ = 2.7, and a simple random sample of size 9 gives x = 66.7. We have z = x − μ 0 σ/

= 3.

Looking up in Table A, we find that the probability of having z ≥ 3 is 1 − 0 .9987 = 0.0013, so the P-value is 0.13%. This means that, if the null hypothesis is true, the probability of getting x ≥ 66 .7 is 0.13%.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 21 / 41

Be careful with the interpretation!

If the null hypothesis is true, the probability of getting x ≥ 66 .7 is 0.13%.

This does not mean that there is a 99.87% probability that the null hypothesis is wrong!

How convincing this evidence is depends, among other things, on how plausible the alternative hypothesis is. If it is (subjectively) unlikely, a smaller P-value is needed to persuade people to believe it.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 22 / 41

The logic behind hypothesis testing: an example with coins

Suppose we have a large collection of coins, some of which are ordinary fair coins and some of which have tails on both sides.

I choose one of these coins, flip it 10 times, and report the result: it came up tails every time.

For a fair coin, the probability that 10 tosses come out all tails is 1/1024, or a bit less than 0.1%.

Since this outcome is unlikely for a fair coin, we interpret it as evidence that the coin has tails on both sides.

The coin example is “hard”: if even one toss is heads, it can’t have tails on both sides. For a “soft” example, which is in some ways more realistic, see the discussion of free throw percentage at the beginning of Chapter 14.

The coin example has the advantage of allowing some simple explicit probability calculations, which allow one to give explicit examples for “how small a P-value is convincing”. (I will only report the results here of the calculations here.)

Suppose we choose a coin at random from a bag in which half the coins are fair and half have tails on both sides. We flip the coin 10 times, and get tails every time. This outcome has a P-value of 1/ 1024 ≈ 0 .000977, or a bit less than 0.1%.

Suppose we choose a coin at random from a bag in which half the coins are fair and half have tails on both sides. We flip the coin 10 times, and get tails every time. This outcome has a P-value of 1/ 1024 ≈ 0 .000977, or a bit less than 0.1%.

The actual probability that the coin has tails on both sides is 1024 / 1025 ≈ 0 .9990. That is, if we do this experiment a very large number of times, then in about 99.90% of all the cases in which we get 10 tails, the coin in fact has tails on both sides.

The alternative hypothesis is quite plausible: there is a 50% chance that a randomly chosen coin has tails on both sides.

Since the alternative hypothesis is plausible, the P-value of about 0.001 is strong evidence that the null hypothesis is false.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 25 / 41

If one in a thousand coins has tails on both sides

Now suppose we choose a coin at random from a bag containing 1024 fair coins and one coin with tails on both sides. Again we flip the coin 10 times, and get tails every time. This outcome has the same P-value as before, a bit less than 0.1%.

The actual probability that the coin has tails on both sides is now 1/2. That is, if we do this experiment a very large number of times, then in about half of all the cases in which we get 10 tails, the coin is actually fair!

The difference is that the alternative hypothesis is now rather implausible: there is a 1/1025, or less than 0.1%, chance that a randomly chosen coin has tails on both sides.

Since the alternative hypothesis is not very plausible, the P-value of about 0 .001 is only weak evidence that the null hypothesis is false.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 26 / 41

If one in a million coins has tails on both sides

Next, suppose we choose a coin at random from a bag containing 2^20 (a bit over a million) fair coins and one coin with tails on both sides. Again we flip the coin 10 times, and get tails every time. This outcome still has the same P-value as before, a bit less than 0.1%.

The actual probability that the coin has tails on both sides is now only 1 /1025. That is, if we do this experiment a very large number of times, then in almost all (more than 999 out of 1000) cases in which we get 10 tails, the coin is nevertheless actually fair!

In this case, the alternative hypothesis is now very implausible: there is a 1 /(1 + 2^20 ) chance, less than one in a million, that a randomly chosen coin has tails on both sides.

Since the alternative hypothesis is so implausible, even a P-value of less than 0.1% has little value as evidence.

If no coins have tails on both sides

Finally, suppose we choose a coin at random from a bag containing many coins, all of which are fair. Once again we flip the coin 10 times, and get tails every time. This outcome still has the same P-value, a bit less than 0 .1%.

The actual probability that the coin has tails on both sides is now zero, no matter what the P-value is, because there were no such coins in the bag.

Digression: what happens to the false positives

An aside: This example has important practical application. Suppose you want to screen a large population for a rare disease, and in the long run the test gives a false positive (says the disease is present when it isn’t) 0 .1% of the time that the person does not have the disease.

If 0.1% of the population has the disease, then about half the people who test positive for the disease in fact don’t have it.

To illustrate why, suppose the test never gives false negatives. (This is not realistic.) Suppose the population is 10, 000 , 000. Then we expect 0.1% of 10 , 000 , 000 , or 10, 000 , people to have the disease. They all test positive.

Of the remaining 9, 990 ,000 people, about 0.1%, that is, 9,990 people, will also test positive, even though they do not have the disease. This is only 10 fewer than the number of people who have the disease.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 29 / 41

Many false positives

If only one person in 100,000 has the disease, then about 99% the people who test positive for the disease in fact don’t have it. For example, in a population of 10, 000 ,000, there will be about 100 people who have the disease, and most of them will presumably test positive. But there will also be about 10, 000 other people who test positive for the disease, even though they don’t have it.

False positives may be subjected to additional tests, perhaps much more expensive, invasive, or risky. But debates over screening programs often ignore what happens to the false positives.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 30 / 41

Back to how small a P-value is convincing

Laetrile is a substance commonly claimed by quacks to cure cancer. Considerable scientific evidence shows that it is worthless.

If you want to persuade me that laetrile is useful in combatting autism, I am going to assume that you are a variant of the medical frauds who make money off it as a fake cancer cure. I am not going to be convinced by a P-value of 0. 001. (Even if I were considering being convinced, I would insist on having competent professionals examine your experiment.)

On the other hand, if you want to convince me that Oregon State students have a lower mean Math SAT score than University of Oregon students, I will probably accept a P-value of 0. 05. Similarly if you want to convince me of the reverse. I don’t consider either outcome obviously implausible.

Back to how small a P-value is convincing (continued)

I will not try to give more examples. Knowing how plausible something is depends on knowledge of, and judgement related to, the subject being investigated.

(The things I know best are not amenable to statistical analysis, so I don’t have a good source of examples.)

Beware of hard and fast cutoffs on P-values

If based on judgement and area-specific knowledge you decide to test at significance α = 0. 05 , and you get a P-value of 0. 0507 , what should you do?

See the book for more. In particular, see Problem 16.5.

However, on our exam problems, if α = 0.05 and P = 0. 0507 , you fail to reject the null hypothesis.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 33 / 41

Beware of outliers

Another reason to plot your data: Outliers might make change significance to insignificance or vice versa.

If this happens, your results are suspect.

Always understand the shape of your data!

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 34 / 41

Statistically significant doesn’t mean important

“Statistically significant” means that there is strong evidence that there is some effect, some difference, or whatever.

It does not mean that the effect or difference is large, especially if the sample is large.

It does not mean that the effect or difference is important.

Statistical significance (continued)

Use a confidence interval for the population mean to estimate how big the difference is. (The book says that hypothesis tests are overused in statistical analyses, and confidence intervals are underused.)

Plot your data to see what is happening. You need subject area knowledge to decide whether the difference is important or not.

Large sample sizes can make small differences statistically significant.

A three point increase in your expected math SAT score may well be statistically significant, but is of little practical importance in getting into college, and is certainly not worth hundreds of dollars spent on SAT preparation courses.

Read pages 393–394 of the book.

If you run many tests, some of them will improperly reject

the null hypothesis merely by chance

Example: I have a large bag of coins. Perhaps some of them have tails on both sides, but you have no idea how many. I choose 2000 coins at random from this bag, flip each one 10 times, and I find that two or three of them come up tails on every flip. I claim (with P = 1/ 1024 < 0 .001) that this is strong evidence that those two coins have tails on both sides. Do you believe me?

You shouldn’t. In the long run, you expect on average nearly one out of a thousand fair coins to come up tails all 10 times.

If I select one of the coins that came up tails 10 times, flip it another 10 times, and it again comes up tails every time, then I have evidence that the coin has tails on both sides.

This time, I formulated the hypotheses before the experiment on this coin.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 37 / 41

Multiple analyses: The New York Times gets it wrong

Similarly, if you run a large number of tests at the (common) significance level α = 0.05 (or 5%), you expect that in about one in 20 cases that the null hypothesis is true, you will reject it with significance 0.05. This is what significance 0.05 means: there is a 5% probability that a simple random sample from a population for which the null hypothesis is true will give a result as extreme as the one observed.

See Example 16.4 in the book. Somebody ran 20 tests (for association of cell phone use with 20 kinds of brain cancers), and found one of them was significant at α = 0.05. The New York Times claimed to be puzzled that for this kind of cancer, “the risk appeared to decrease... with greater mobile phone use”.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 38 / 41

As the book says, “Running one test and reaching the α = 0.05 level is reasonably good evidence that you have found something; running 20 tests and reaching that level only once is not.”

Even if you run, say, 15 or 16 tests and reach the α = 0.05 level once, this isn’t good evidence of anything.

If you really think there might be an association with that form of brain cancer, you must start with a completely independent set of data and test that one hypothesis.

Also see the margin item “Honest hypotheses?” on page 366 of the book. (This is also an example of why the hypotheses must be formulated ahead of time.)

Formulate your hypotheses before you do your experiment

Example: Prof. Greenbottle chooses a coin, flips it 10 times, and records the sequence of results, say TTHTTHHHTT. For a fair coin, the probability of getting exactly this sequence is 1/1024 = 1/ 210. He claims that the outcome is strong evidence that the coin is biased to produce this particular result, with P = 1/ 1024 < 0 .001.

This claim is nonsense.

Generally, any claim formulated after the experiment is done is suspect.

(There is some overlap with the issue of multiple analyses.)

Read Chapter 16!

Read Chapter 16 carefully.

N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 41 / 41