Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The document covers topics related to probability and statistics, including the capture-recapture method of sampling, clinical studies, and descriptive statistics. It explains the logic of capture-recapture computations and the controlled clinical study methodology. The document also covers descriptive statistics, including frequency tables and relative frequency. examples and questions to test understanding of the topics presented.
Typology: Exercises
1 / 20
Capture – Recapture is a common method used to estimate the size of a population by sampling. Biologists and ecologists use this method extensively to estimate wild animal populations.
The logic of capture-recapture computations : The computation is the proportion formed by two correctly stacked ratios.
IF we can assume that the recaptured sample is representative of the whole population, then
Total intherecapture sample
taggedintherecapturesample Total population
taggedintotal population
Notice that we KNOW the # tagged in the total population – that is the number we tagged in the first sample. Example 13.4 (reworded – This is the way such questions will appear on the exam) A large pond is stocked with catfish. You capture 200 catfish, tag and release them. You wait enough time for the tagged fish to spread out more with the general population. Then you capture another sample. This sample has 250 catfish. Of the 250 catfish in this second sample, 35 have tags.
If the second sample is representative of the catfish population in the pond, estimate the number of catfish in the pond.
I suggest you memorize this rather than the formula on page 530 of your text
Clinical Studies do not collect data for the same purposes as surveys and censuses.
Instead, Clinical Studies attempt to determine whether a single variable can cause a certain effect.
New vaccines and drug treatments are put through clinical studies before being officially approved for public use.
Things that are “unhealthy” like cigarettes and caffeine are officially identified as “unhealthy” after clinical studies show that people who include significant amounts of them in their lifestyle have more health problems than people who do not include them.
A controlled clinical study uses two groups:
Confounding Variable: a characteristic (not the one being studied) in which the control and treatment groups differ. Then you can’t tell whether the effect was due to the characteristic being studied or due to this other characteristic or a combination of both.
Randomized Controlled Study : subjects are randomly assigned to either the treatment or control group
We can only deduce that the treatment CAUSES the effect if the treatment group experiences the effect and the control group does not experience the effect.
Placebo Effect : just the idea that one is getting treatment can produce positive results. People receiving a placebo (a harmless, inactive substance like a “sugar pill”) often report experiencing improvement.
Blind Study: The placebo effect cannot be eliminated, but it can be controlled by giving a placebo to the control group and conducting a blind study, in which neither the treatment nor the control group know whether they are getting the real treatment or the placebo.
Double Blind Study: The scientists conducting the study are also not aware of whether the participant is getting the real treatment or the placebo. (participants and researchers are both “blind”).
Even clinical studies that are properly designed can lead to conflicting conclusions. But when clinical studies of the same variable, done in different labs by different groups, consistently find the same conclusion , the clinical study method is persuasive.
Experimental Variable: Flu Vaccine
Effect: Prevents getting the flu Causes???
Association is NOT Causation : Just because 2 conditions occur together does not mean one condition causes the other. They may both be caused by some 3 rd^ condition, or they may just coincide by chance. ALSO a single effect can have many possible and actual causes.
Example: The school district that receives the most federal money and pays the highest teachers’ salaries has the lowest national test scores. Does higher teacher pay cause low test scores?
Example : A black cat crosses your path in the morning and by afternoon you have lost your job. Did the black cat crossing your path cause you to lose your job?
Study #1: In order to determine the effectiveness of a new drug for HIV treatment, the researchers conducted a study at the Park HIV Clinic in Philadelphia. The clinic first asked all 8,000 of their HIV patients who were between the ages of 20 and 40 years of age if they would be willing to participate.
Only 2000 volunteered to participate in the study. All 2000 of those volunteers were given a battery of medical assessments to determine the severity of symptoms they were experiencing and prognosis.
The researchers looked at the results of these medical assessments and found there were 150 of these volunteers who were in the beginning stages of HIV infection and were showing only minimal symptoms. These 150 patients became the participants in the study.
By random assignment, 75 were assigned to “Group A” and the other 75 were assigned to “Group B”. Group A received injections from “Drug A” vials while Group B received injections from “Drug B” vials.
One vial was the experimental drug and the other vial was a placebo treatment. Neither the patients nor the researchers knew whether “Drug A” or “Drug B” was the actual treatment drug.
Participants received the injections once a week for 6 months. At the end of the 6 months of treatment, the patients were again given the same battery of medical assessment to determine the severity of symptoms they were experiencing and prognosis. The average level of health was found to be significantly better for Group B.
Group B turned out to be the group that had received the real drug treatment.
Study #2: In order to determine the effectiveness of a new vaccine that is alleged to cure “math anxiety”, a clinical study was conducted. One thousand college students enrolled in math courses across the U.S. were chosen to participate in the study. The 1,000 students were broken up into two groups. Those enrolled in calculus courses or higher were given the real vaccine. The students in remedial and basic math courses were given a fake vaccine consisting of sugared water. None of the students knew whether they were being given the real or the fake vaccine, but the researcher conducting the experiment knew. At the end of the semester the students were given a test that measured their level of math anxiety. The students in the treatment group showed significantly lower levels of math anxiety than those in the control group. On the basis of this experiment the vaccine was advertised as being highly effective in fighting math anxiety.
Descriptive Statistics : statistics that summarize or otherwise describe large amounts of numerical data.
Data set : a collection of data values
Data points : the individual data values in the data set
Raw data: data as it was first gathered before any summarizing or computational manipulation
N is the size of the data set (the population of data)
Frequency : how often a particular data value occurs
Outliers : Extreme values in the data that do not fit the overall pattern of the data.
Make a frequency tally for this data set of exam scores. This is a first step in organizing data.
95, 90, 85, 90, 70, 15, 70, 50, 55, 80, 70, 80, 60, 45, 70, 75, 75, 75, 60, 65
For the frequency tally above, make a frequency table.
Important: In a Frequency Table, you only include the scores that actually happened.
Score
Frequency
Relative Frequency : the percent of the total population that had that value rather than the actual number that had that value.
Relative frequency is used most commonly when the actual frequencies are very large numbers. This makes them easier to compare.
Important: If you are graphing relative frequencies, be careful to:
Example: Find the relative frequencies for these raw data
What do we need to find first????
Score Frequency Relative Frequency
5 3,
4 2,
3 1,
2 1,
1 250
Example : In a high stakes exam used for academic scholarship awards, N=200,000 and
the relative frequency of the of a perfect score is 0.04%. How many students made a
perfect score on the exam?
Bar graphs are often used to show frequencies. The higher the bar, the more frequent that data value.
Pictograph: uses pictures or icons to create the length of the bars
Characteristics of Bar Graphs:
Steps for creating a proper bar graph:
Step 1 : Organize the data values
Step 2: Make the vertical scale (usually) the frequencies. Use equal intervals and be sure to label the scale “frequency”
Step 3 : Make the horizontal scale the possible values. Make it an equal interval scale, including values that did not occur. Include a word-label that tells what those values represent.
Step 4. Draw a bar above each value that did occur, making the bar as long as the frequency for that value. Keep the bars more narrow than the space between values so that the bars do not touch one another.
Example:
Create a bar graph to display this data set as relative frequencies. The letters represent the letters of the correct answers on a multiple choice test.
A B A C B C C B C A A A C C B B B C B B C A B A B C B B B C B A A B A B C C A A A C C B B A A A B C
Step 1: Steps 2- Organize the data values Graph the values
Example from p. 553 of text: “Cheating” on the choice of starting value on vertical axis and stretching the scale on the vertical axis to make it look like there is more change than there actually is.
Example from P. 555 of text: When comparing characteristics of a population that is broken up into categories, it is essential to take into account the relative sizes of the various categories.
In the General Population:
Children (2-11 years) comprise 15% of the U.S. population
Teens (12-17) comprise 8% of the U. S. population
Adults comprise approximately 75% of the U. S. population
Circle Graphs (Pie Charts) are good for showing the respective sizes of categories within a whole population. The circle represents the whole and the size of the sectors (the “slices”) are proportional to the relative frequency of each category.
Remember 25% may be thought of as the ratio 100
Also remember that a circle contains 360˚.
It is often helpful to work with the reduced fraction for the percent rather than the number of degrees.
So the basic proportion relationship when figuring out the size of each sector is:
360 degrees
sizeofsectorindegrees 100
Make a Circle Graph to Represent the following data:
N = 100,000 marbles that are either red, blue, green, yellow, or purple.
Type Percent Degrees
Red 25%
Blue 50%
Green 10%
Yellow 10%
Purple 5%
Variable: in statistics a variable is any characteristic that varies within the population. Examples: what color, what size, what kind, how many of...
Categorical Variable : (qualitative variable) represents a quality that is not normally measured numerically. For instance, gender.
Numerical Variable : (Quantitative variable) represents a measurable quality.
Practice: For each situation below, indicate whether the variable should be considered Categorical or Numerical. If Numerical, is it discrete or continuous?
(a) Hair color: blond, brown, black, red, grey
(b) Gender: male, female
(c) Shoe size: 4, 4 2
(d) Ethnicity: Black, Native American, Hispanic,. .
(e) Height in inches
(f) Gender when asked to record it as a “1” if female and a “2” if male
Histogram : a variation of a bar graph showing relative frequencies.
Remember relative frequencies are the percent of the total population that had that value. In really large data sets where the raw values are very close together, we frequently group the raw data values into equal-sized classes and counting how many raw data values fall in each of these class intervals.
Important: When creating histograms, it is mathematically correct to draw bars for adjacent categories touching one another (different from regular bar graphs). This is because the class intervals are continuous, where as the original categories discrete. (letter grades like A, B, C, D, F are discrete, but class intervals like 89-100 are continuous).
Example: GPA's for all students at MSUM would be an example of a large data set where the values are not well-separated. The values only run from 0 to 4 and are computed to 3 decimal places so you get values like 3.725, 3.724, 3.726, 3.725 all of which are very close together and are pretty much the same GPA.
Teachers frequently make histograms of grades grouped by A, B, C, D, and F to look at the type of grade distribution in their classes. This means that A+, A and A- are all recorded in the same bar. This is another example of data where the values run together and can be categorized together.
Class Practice: Make a histogram of the following quiz averages. Use the final grade chart below to group the quiz averages into letter grades.
72 85.5 93.5 68 73.5 82.5 80 79. 56.5 87.5 89.5 71 79.5 86 75 76. 83 86.5 78 67
Grading Scale:
A 91- B 81- C 71- D 61- F 60 -
Remember that the frequencies are not the scores, but _______________________________.
Another way to summarize data and make large data sets more comprehensible, is to numerically summarize them numerically. There are two main ingredients in such a summary:
Mean: The mean is the arithmetic average.
(see the next example)
Example: For the frequency chart below, you cannot get the average by adding just the numbers in the first row to get the total of the scores BECAUSE, for instance, the score of 9 did not happen just once, it happened 10 times.
If you need to compute the mean from a frequency table, it is easiest to add another row to the table that gives the “weighted” score. To get the weighted score, you multiply the exam score by the number of times it happened (the frequency). For instance, the score of 9 happened 10 times, so its weighted score is 9×10 = 90 because that is what you would get if your wrote out all 10 of them and added them up.
Then the mean of ALL OF THE SCORES = TotalofalloftheFrequencies
TotalofalloftheWeightedScores
Now finish computing the mean for this Exam.
Point Scores on a 24-Point Exam Exam Score 1 6 7 8 9 10 11 12 13 14 15 16 24 Frequency 1 1 2 6 10 16 13 9 8 5 2 1 1
Weighted Scores
Median : The median the other measure of center that you are responsible for.
Class Practice:
(a) what is the mean of this data set?
(b) what is the median of this data set?
(a) what is the mean of this data set?
(b) what is the median of this data set?
(a) what is the mean of this data set?
(b) What is the median of this data set?
Score 0 4 6 7 8 9 10 Frequency 2 1 2 1 5 8 6
Percentile: Percent and Percentile are NOT the same thing.
Compute a Percentile:
Step 1 : List the data in numerical order, from least to greatest.
Remember Percentile tells the relative position in the ordered list of data. So think of the data as being an ordered list of values like
d (^) 1 , d 2 , d 3 , d 4 ,..., d N where each of these have a numerical value, but they also have a position value (the subscript)
Step 2 : Compute the locator for the pth^ –percentile using the formula:
L = N
p ⎟ ⎠
Step 3 :
Try It Yourself: Find the 80th^ percentile value for the following GPA’s:
3.4, 3.9, 3.3, 3.6, 3.5, 3.4, 4.0, 3.7, 3.3, 3.8, 3.6, 3.9, 3.7, 3.4, 3.
When L is a whole number: the p th − percentile = 2
dL + dL + 1
When L is not a whole number: the p th^ − percentile =L rounded up
The Quartiles divide the data set into four quarters. The data are first ordered.
Then find the middle of the data (which is the median) and that is Quartile 2 (abbreviated Q2 ). 50% of the data are below Q2 and 50% of the data are above Q.
Q1 (the first quartile) is the point below which 25% of the data occur. (If there are an even number of scores, average the two middle scores)
Q3 (the third quartile) is the point below which 75% of the data occur.
Q4 (the fourth quartile mark) of course, would be the highest value in the list so that 100% of the data falls below that point.
Generally we only talk about Q1 and Q3. Instead of talking about Q2 we call it the Median and instead of Q4 we call it the Maximum.
Example 14.14 (p. 563 of text)
During the last year, 11 homes sold in the Green Hills subdivision. The selling prices, in chronological order, were:
$167,000 Find the Median and Quartiles for this situation. 152, 128, 134, 192, 163, 121, 145, 170, 138, 155,
Q2 is 30
Five Number Summary: consists of
Example : Find the five-score summary for this data set:
7 4 10 8 5 6 4 6 1 3 7 5
Box Plot : A graphical representation of the 5-number summary. Box Plots are good for comparing two similar data sets, for instance two different samples from the same population. It gives a visual way to assess whether the two samples are significantly different or not.
Step 1 : Draw a scale that covers the entire range of values.
Step 2: Draw a box that has Q1 as the location of one end and Q3 as the location of the other end on this scale.
Step 3 : Draw a line through the box indicating the Median, M.
Step 4. Locate the Minimum on the scale. Draw a “whisker” from the minimum to the nearest end of the box.
Step 5 : Locate the Maximum data value on the scale. Draw a “whisker” from the maximum to the box.
Example : Draw a box plot for this following 5-number summary:
Min = 1, Q1 = 9, M=11, Q3=12, Max =
Range : tells how spread out the data are. Take the difference (subtract) between the highest value and the lowest value of the data set. Notice that the range depends only on the most extreme values of the data: the highest (maximum) and the lowest (minimum) values.
Example: Find the range of this data set:
72 85.5 93.5 68 73.5 82.5 80 79.5 56.5 87.
Example: Draw the box plot for the data in the previous example:
Interquartile Range : the difference (subtraction) between the third quartile and the first quartile, Q3 −Q1.
Question: What percent of all the data points must lie within the interquartile range? Why?
Standard Deviation : the most important and most commonly used measure of spread for a data set. It represents the average amount that the data points differ (deviate) from the Mean.
Step 1 : Find the average (Mean) of the data set.
Step 2 : Make a table with three columns and as many rows as there are data values.
Step 3. In the first column, put the data values. It makes things easier if you order them from least to greatest.
Step 4 : In the second column put the answer to this computation (Data Value −Mean) for each row. Do all the subtractions in that order. Some of these values will be negative.
Step 5 : In the third column, square the value in the second column.
Step 6 : Total the third column. This is the sum of the squares of the deviations.
Step 7 : The standard deviation, formally represented in statistics by the small
n
totalofcolumn 3
Example : Find the Standard Deviation for this data set of test scores: 1, 6, 7, 8, 8, 9, 10, 11, 12, 13, 14, 15, 16, 24