












Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prepare for STA 199 Midterm 1 Version B with this comprehensive study resource. Includes practice questions, accurate answers, and detailed explanations covering statistical methods, probability concepts, data analysis, distributions, and introductory inference techniques. Designed for undergraduate students to strengthen understanding, improve problem-solving skills, and excel in exams. Perfect for revision, self-assessment, and mastering key concepts in introductory statistics and data science.
Typology: Exams
1 / 20
This page cannot be seen from the preview
Don't miss anything!













Section 1: Foundations of Data & Experimental Design (Questions 1-15)
1. A researcher wants to study the average screen time of all college students
in California. She randomly selects 1,500 students from 10 different California
universities and records their daily screen time. What is the population in this
study?
A) The 1,500 selected students
B) Daily screen time
C) All college students in California
D) The 10 universities
Rationale: The population is the entire group of interest, which is all college
students in California. The sample is the 1,500 selected students.
2. A study finds a strong positive correlation between ice cream sales and the
number of drownings. This is an example of:
A) A causal relationship
B) A confounding variable (temperature)
C) A well-designed experiment
D) A negative association
Rationale: This is a classic example of a confounding variable (lurking variable).
Hot weather causes both ice cream sales and swimming/drownings to increase,
creating a correlation without causation.
3. To determine if a new drug lowers blood pressure, researchers randomly
assign 200 patients to receive either the new drug or a placebo. Neither the
patients nor the doctors measuring the blood pressure know which treatment
was given. This is best described as a:
A) Observational study
B) Randomized, double-blind, controlled experiment
C) Block design with matching
D) Retrospective case-control study
Rationale: The key features are random assignment (experiment), both subjects
and evaluators blinded (double-blind), and a comparison group (controlled).
4. Which of the following variables is continuous?
A) Number of siblings
B) Time to run 100 meters (in seconds)
C) Zip code
D) Final exam grade (A, B, C, D, F)
Rationale: Continuous variables can take any value within a range (e.g., 12.
seconds). Number of siblings is discrete; zip code is nominal; letter grades are
ordinal.
5. A survey asks respondents to rate their satisfaction with a product on a
scale from 1 (Very Unsatisfied) to 5 (Very Satisfied). This is an example of
what type of data?
A) Nominal
B) Ordinal
C) Interval
D) Ratio
Rationale: Ordinal data have a meaningful order (1 < 2 < 3), but the difference
between values is not necessarily uniform or meaningful in a mathematical sense.
6. A university wants to ensure its student survey represents the proportion of
freshmen, sophomores, juniors, and seniors. They randomly select 100
students from each class year. This sampling method is:
A) Simple random sample
B) Cluster sample
C) Stratified random sample
D) Convenience sample
Rationale: The population is divided into strata (class years), and random samples
are taken from each stratum.
C) The researcher's bias in measuring outcomes
D) A type of sampling error
Rationale: The placebo effect is a real, measurable physiological or psychological
response to an inert substance or procedure.
1 2. A researcher is studying the effect of sleep deprivation on test scores. The
amount of sleep a participant gets is the:
A) Response variable
B) Confounding variable
C) Explanatory variable
D) Lurking variable
Rationale: The explanatory variable (independent variable) is the one that is
manipulated or used to explain changes in the response variable (dependent
variable).
1 3. Which of the following is a statistic?
A) The true average height of all women in the world (μ)
B) The average height of 50 women randomly selected from New York (x>)
C) The population proportion (p)
D) A fixed, unknown value
Rationale: A statistic is a numerical summary calculated from a sample.
Population parameters (μ, p) are fixed but usually unknown.
1 4. If a sample is biased, it means:
A) The sample size is too small.
B) The sample was not selected randomly.
C) The sample does not accurately represent the target population.
D) The sample has a large standard deviation.
Rationale: Bias is systematic error in sampling that leads to a sample that is not
representative of the population from which it was drawn.
1 5. A study follows a group of 1,000 healthy adults over 20 years to see who
develops heart disease and how their lifestyle factors relate to it. This is a:
A) Cross-sectional study
B) Prospective cohort study
C) Retrospective case-control study
D) Randomized experiment
Rationale: A prospective study follows a group (cohort) forward in time to
observe outcomes.
Section 2: Descriptive Statistics & Data Visualization (Questions 16-35)
1 6. Which measure of center is most affected by outliers?
A) Median
B) Mode
C) Mean
D) Interquartile Range
Rationale: The mean uses all values, so extreme outliers can pull it in their
direction. The median is resistant to outliers.
1 7. For a skewed right distribution, which relationship between the mean and
median is most likely?
A) Mean > Median
B) Mean < Median
C) Mean = Median
D) Cannot be determined
Rationale: In a right-skewed distribution, the tail pulls the mean to the right,
making it larger than the median.
1 8. The standard deviation is best described as:
A) The average of the data points
B) The middle value of the data
C) The typical distance of data points from the mean
D) The range of the middle 50% of the data
Rationale: The standard deviation is a measure of spread that quantifies how much
individual data points deviate from the mean on average.
1 9. The five-number summary includes:
A) Mean, Median, Mode, Range, Standard Deviation
B) Minimum, Q1, Median, Q3, Maximum
C) Mean, Standard Deviation, Variance, Range
D) Z-scores, Percentiles, Quartiles
D) Cannot be determined
Rationale: The standard deviation is the square root of the variance. √25 = 5.
2 5. A side-by-side box plot is best used to:
A) Show the distribution of one categorical variable
B) Compare the distribution of a quantitative variable across multiple groups
C) Show the relationship between two quantitative variables
D) Display the frequency of individual data points
Rationale: Side-by-side box plots allow for visual comparison of center, spread,
and shape across different categories.
2 6. A scatterplot is used to examine the relationship between:
A) Two categorical variables
B) Two quantitative variables
C) One quantitative and one categorical variable
D) A variable and itself over time
Rationale: Scatterplots are the standard graphical tool for visualizing association
between two continuous variables.
2 7. The correlation coefficient (r) measures:
A) The slope of the regression line
B) The strength and direction of a linear relationship
C) The percentage of variation explained
D) The causality between two variables
Rationale: The correlation coefficient, r, ranges from -1 to +1 and quantifies the
linear association's strength and direction.
2 8. A correlation of r = 0.92 indicates:
A) A weak, negative linear relationship
B) A strong, positive linear relationship
C) A weak, positive linear relationship
D) No linear relationship
Rationale: A correlation close to +1 indicates a strong positive linear relationship.
2 9. Which of the following is a resistant measure of spread?
A) Standard Deviation
B) Variance
C) Range
D) Interquartile Range (IQR)
Rationale: IQR is based on percentiles and is not influenced by extreme values,
unlike range, variance, and standard deviation.
3 0. In a perfectly symmetrical, bell-shaped distribution, approximately what
percentage of data falls within 2 standard deviations of the mean?
Rationale: According to the Empirical Rule for normal distributions, about 95% of
data lies within 2 standard deviations of the mean.
3 1. A stem-and-leaf plot has the advantage over a histogram of:
A) Being better for categorical data
B) Preserving the original data values
C) Always showing the exact shape
D) Being easier to create for large datasets
Rationale: Stem-and-leaf plots show the distribution while retaining each
individual data point.
3 2. What is the median of the following dataset: 4, 8, 12, 16, 20?
Rationale: The median is the middle value in an ordered list. The dataset has 5
values, so the 3rd value is 12.
3 3. The 90th percentile of a dataset means:
A) 90% of the data is above that value.
B) 90% of the data is below that value.
C) The value is 90% of the mean.
D) The value is 90 standard deviations from the mean.
Rationale: Independence means the occurrence of one event does not affect the
probability of the other; thus, the conditional probability equals the marginal
probability.
3 8. A fair six-sided die is rolled. What is the probability of rolling a number
greater than 4?
Rationale: Numbers greater than 4 are 5 and 6. 2 favorable outcomes out of 6 total
3 9. A card is drawn from a standard 52-card deck. What is the probability it is
a heart or a king?
Rationale: Using the addition rule: P(Heart or King) = P(Heart) + P(King) -
P(Heart and King) = 13/52 + 4/52 - 1/52 = 16/52 = 4/13.
4 0. The probability that it rains today is 0.3. The probability that it rains
tomorrow is 0.4. Assuming independence, what is the probability it rains on
both days?
Rationale: For independent events, P(A and B) = P(A) * P(B) = 0.3 * 0.4 = 0.12.
4 1. A bag contains 5 red marbles and 3 blue marbles. Two marbles are drawn
without replacement. What is the probability that both are red?
Rationale: P(1st red) = 5/8. P(2nd red | 1st red) = 4/7. Multiply for "and" in
conditional situations.
4 2. If P(A) = 0.6, P(B) = 0.5, and P(A and B) = 0.3, then events A and B are:
A) Mutually exclusive only
B) Independent only
C) Both mutually exclusive and independent
D) Neither mutually exclusive nor independent
Rationale: Check independence: P(A)P(B) = 0.60.5 = 0.3 = P(A and B), so they
are independent. Since P(A and B) ≠ 0, they are not mutually exclusive.
4 3. A test for a disease has a 95% sensitivity (true positive rate) and a 90%
specificity (true negative rate). If 2% of the population has the disease, what is
the probability that a person actually has the disease given they tested
positive? This requires:
A) Binomial distribution
B) Bayes' Theorem
C) Law of Large Numbers
D) Addition Rule
Rationale: Bayes' Theorem is used to update the probability of an event based on
new evidence (a positive test result).
4 4. The complement of an event A, denoted A^c, has probability:
Rationale: The complement rule states that P(A) + P(A^c) = 1.
4 5. A probability distribution of a discrete random variable must satisfy:
A) Each probability is between 0 and 1, and they sum to 1.
B) Each probability is between -1 and 1.
C) The sum of probabilities is 0.
D) The probabilities must be all equal.
Rationale: This is a fundamental requirement for a valid probability mass
function.
Rationale: P(X=1) = C(5,1) * (0.2)^1 * (0.8)^4 = 5 * 0.2 * (0.8)^4.
5 1. Which of the following is NOT a property of the normal distribution?
A) It is symmetric about its mean.
B) It is discrete.
C) The mean, median, and mode are equal.
D) It is bell-shaped.
Rationale: The normal distribution is continuous, not discrete.
5 2. The standard normal distribution has a mean of _____ and a standard
deviation of _____.
Rationale: Z ~ N(μ=0, σ=1).
5 3. If X ~ N(100, 15), what is the probability that X is greater than 130?
Rationale: z = (130-100)/15 = 2. P(X > 130) = P(Z > 2).
5 4. The 68-95-99.7 rule applies to:
A) Any distribution
B) Normal distributions
C) Skewed distributions
D) Binomial distributions
Rationale: This empirical rule is specific to normal distributions.
5 5. A random variable that can take on any value within an interval is a:
A) Discrete random variable
B) Continuous random variable
C) Binomial random variable
D) Categorical variable
Rationale: Continuous random variables have an uncountable number of possible
values (e.g., time, weight).
Section 4: Sampling Distributions & Inference Foundations (Questions 56-75)
5 6. The sampling distribution of a statistic is:
A) The distribution of values in a single sample.
B) The distribution of the statistic from all possible samples of the same size.
C) The distribution of the population.
D) The distribution of the standard deviation.
Rationale: It's the theoretical distribution of a statistic (like the sample mean)
across repeated sampling.
5 7. According to the Central Limit Theorem (CLT), for a sufficiently large
sample size, the sampling distribution of the sample mean will be
approximately normal regardless of:
A) The sample size
B) The population mean
C) The shape of the population distribution
D) The population standard deviation
Rationale: The CLT's power is that normality of the sampling distribution holds
for non-normal populations if n is large.
8. The mean of the sampling distribution of the sample mean (μ_x> ) is equal
to:
A) σ/√n
B) μ (the population mean)
C) x U
(the sample mean)
D) s (the sample standard deviation)
Rationale: The sample mean is an unbiased estimator of the population mean.
5 9. The standard deviation of the sampling distribution of the sample mean is
called the:
A) Standard deviation
C) p
D) np
Rationale: The sample proportion is an unbiased estimator of the population
proportion p.
6 4. Bias refers to:
A) The variability of an estimator
B) The difference between the expected value of an estimator and the
parameter
C) The standard error of an estimator
D) The sample size
Rationale: Bias = E(estimator) - parameter. An unbiased estimator has bias = 0.
6 5. Which of the following will reduce the margin of error in a confidence
interval?
A) Increasing the confidence level
B) Increasing the sample size
C) Decreasing the sample size
D) Increasing the population standard deviation
Rationale: Margin of error = z* (σ/√n). Increasing n decreases the margin of error.
6 6. A 95% confidence interval for a population mean is (45, 55). This means:
A) 95% of the population data falls between 45 and 55.
B) There is a 95% probability that the true mean is between 45 and 55.
C) In repeated sampling, 95% of such intervals will contain the true
population mean.
D) The sample mean is 50 with 95% certainty.
Rationale: This is the correct frequentist interpretation of a confidence interval.
7. If all else remains the same, increasing the confidence level from 90% to
5% will cause the confidence interval to become:
A) Wider
B) Narrower
C) The same width
D) Impossible to determine
Rationale: A higher confidence level requires a larger critical value (z*), which
increases the margin of error.
6 8. The t-distribution is used for inference about a population mean when:
A) The sample size is large
B) The population standard deviation (σ) is unknown
C) The population is not normal
D) The sample proportion is being estimated
Rationale: The t-distribution accounts for the additional uncertainty when
estimating σ with the sample standard deviation s.
6 9. As the degrees of freedom increase, the t-distribution:
A) Approaches the standard normal distribution
B) Becomes more skewed
C) Becomes more spread out
D) Has a mean that increases
Rationale: The t-distribution has heavier tails than the normal, but as df → ∞, it
converges to N(0,1).
7 0. A Type I error in hypothesis testing is:
A) Failing to reject a false null hypothesis
B) Rejecting a true null hypothesis
C) Rejecting a false null hypothesis
D) Failing to reject a true null hypothesis
Rationale: Type I error = false positive. The probability of a Type I error is α
(significance level).
7 1. The p-value of a hypothesis test is:
A) The probability the null hypothesis is true.
B) The probability of obtaining results as extreme as or more extreme than the
observed results, assuming H0 is true.
C) The probability of making a Type II error.
D) The significance level.
Rationale: The p-value measures the strength of evidence against the null
hypothesis.
7 2. If the p-value is less than the significance level (α), we:
A) Fail to reject H
B) Reject H
B) There is evidence that more than 50% of voters favor the candidate.
C) The sample proportion was exactly 0.55.
D) 95% of voters fall between 52% and 58%.
Rationale: Since the entire interval is above 0.50, the data provide evidence that
the true population proportion is > 0.50.
7 7. A study reports a p-value of 0.03 for a two-tailed test at α = 0.05. The
correct conclusion is:
A) Fail to reject H0; there is not sufficient evidence.
B) Reject H0; there is sufficient evidence.
C) Accept H0; the null hypothesis is true.
D) The result is not statistically significant.
Rationale: Since p-value (0.03) < α (0.05), we reject the null hypothesis.
7 8. Which of the following is an example of a non-sampling error?
A) Using a small sample size
B) A poorly worded survey question that confuses respondents
C) Natural variability in the sample
D) The margin of error
Rationale: Non-sampling errors include measurement error, non-response bias,
and processing errors, not related to the randomness of sampling.
7 9. A researcher fails to reject the null hypothesis when the null hypothesis is
actually false. This is a:
A) Type I error
B) Type II error
C) Correct decision
D) Sampling error
Rationale: Type II error (β) is failing to reject a false null hypothesis.
8 0. Which of the following is the most important factor in determining the
reliability of a study's conclusions?
A) The cost of the study
B) The number of researchers involved
C) The design (e.g., randomization, sample representativeness) and
appropriate sample size
D) The journal in which it was published
Rationale: A study's validity and reliability are determined by its methodological
rigor, including design, sampling, and analysis, not superficial factors.