











































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A practice exam for probability and statistics in data science using python. It includes multiple-choice questions covering topics such as continuous variables, sampling methods, measures of central tendency, variance, interquartile range, z-scores, skewness, empirical rule, outlier detection, contingency tables, histograms, probability rules, bayes' theorem, naive bayes classifier, expected value, random variables, and various distributions (bernoulli, binomial, poisson, normal, exponential). Each question is accompanied by a detailed explanation of the correct answer, making it a valuable resource for students and professionals preparing for certification or seeking to reinforce their understanding of statistical concepts and their implementation in python. The exam covers both theoretical knowledge and practical application using libraries like pandas and scipy.stats, providing a comprehensive review of essential topics in probability and statistics for data science.
Typology: Exams
1 / 83
This page cannot be seen from the preview
Don't miss anything!












































































Question 1. Which of the following best describes a quantitative continuous variable? A) Number of students in a class B) Temperature measured to two decimal places C) Gender of respondents D) Rating on a 5‑point Likert scale Answer: B Explanation: Continuous variables can take any value within an interval; temperature measured precisely is continuous. Question 2. In sampling, stratified sampling is primarily used to: A) Reduce cost by clustering similar units B) Ensure each subgroup is represented proportionally C) Select units completely at random D) Sample only the largest clusters Answer: B Explanation: Stratified sampling divides the population into strata and samples from each proportionally, improving representation. Question 3. The mean is most sensitive to: A) Skewed distributions B) Outliers C) Sample size D) Number of categories Answer: B Explanation: Extreme values (outliers) pull the arithmetic mean toward them more than median or mode.
Question 4. Which pandas function returns the mode(s) of a Series? A) df.mean() B) df.median() C) df.mode() D) df.var() Answer: C Explanation: mode() returns the most frequent value(s) in the Series. Question 5. The variance of a sample differs from the population variance because: A) It uses N‑1 in the denominator B) It uses N in the denominator C) It squares the mean instead of the deviations D) It does not square the deviations Answer: A Explanation: Sample variance uses Bessel’s correction (N‑1) to produce an unbiased estimator. Question 6. The Interquartile Range (IQR) is calculated as: A) Q3 – Q B) Q1 – Q C) Median – Q D) Q3 – Median Answer: A Explanation: IQR measures the spread of the middle 50 % of data: third quartile minus first quartile.
Question 10. Which plot is most useful for detecting outliers in a univariate numeric variable? A) Histogram B) Box plot C) Scatter plot D) Bar chart Answer: B Explanation: Box plots display whiskers and points beyond them, highlighting outliers. Question 11. A contingency table is used to: A) Summarize continuous data distribution B) Show joint frequencies of two categorical variables C) Plot time series data D) Compute correlation coefficients Answer: B Explanation: It cross‑tabs counts of categories from two variables. Question 12. In matplotlib, which function creates a histogram? A) plt.plot() B) plt.bar() C) plt.hist() D) plt.scatter() Answer: C Explanation: hist() bins data and draws the histogram. Question 13. The probability of the union of two mutually exclusive events A and B is:
Answer: B Explanation: Mutually exclusive events cannot occur together, so P(A∪B)=P(A)+P(B). Question 14. If two events are independent, then P(A ∩ B) equals: A) P(A) + P(B) B) P(A) − P(B) C) P(A) × P(B) D) P(A) / P(B) Answer: C Explanation: Independence implies multiplication rule for intersection. Question 15. Bayes’ theorem updates the prior probability to obtain the posterior probability using: A) Likelihood only B) Prior only C) Likelihood and evidence D) Evidence only Answer: C Explanation: Posterior ∝ Prior × Likelihood; evidence (normalizing constant) ensures probabilities sum to 1.
Question 19. The variance of a Bernoulli(p) random variable is: A) p B) p(1‑p) C) p² D) (1‑p)² Answer: B Explanation: Var = p(1‑p) for a binary outcome. Question 20. Which distribution models the number of successes in 10 independent trials with success probability 0.2? A) Poisson(λ=2) B) Binomial(n=10, p=0.2) C) Geometric(p=0.2) D) Negative Binomial(r=10, p=0.2) Answer: B Explanation: Fixed number of Bernoulli trials → Binomial. Question 21. The probability mass function (PMF) of a Poisson(λ) distribution at k=3 is: A) (e^{‑λ} λ³)/3! B) λ³/3! C) e^{‑λ} λ³ D) (λ³)/e^{‑λ} Answer: A Explanation: Poisson PMF: P(k)=e^{‑λ} λ^{k}/k!
Question 22. Which scipy.stats function generates random variates from a normal distribution with mean 0 and std 1? A) norm.rvs(loc=0, scale=1) B) norm.pdf(loc=0, scale=1) C) norm.cdf(loc=0, scale=1) D) norm.ppf(loc=0, scale=1) Answer: A Explanation: rvs produces random samples; pdf, cdf, ppf compute densities, probabilities, quantiles. Question 23. The uniform distribution on interval [a, b] has variance: A) (b‑a)² / B) (b‑a) / C) (b‑a)² / D) (b‑a) / Answer: A Explanation: Variance of continuous uniform = (b‑a)²/12. Question 24. A standard normal variable Z has mean: A) 0 B) 1 C) μ D) σ Answer: A Explanation: By definition, standard normal N(0,1) has mean 0.
Question 28. The standard error of the mean for a sample of size n with sample standard deviation s is: A) s / √n B) s × √n C) s / n D) s × n Answer: A Explanation: SE = s / sqrt(n). Question 29. In a 95 % confidence interval for a population mean, the margin of error is: A) t_{α/2, df} · SE B) z_{α/2} · SE C) Both A and B depending on known σ D) None of the above Answer: C Explanation: Use z when σ known, t when σ unknown. Question 30. The law of large numbers guarantees that as sample size increases: A) Sample variance equals population variance B) Sample mean converges to the population mean C) Sample median converges to the mode D) Sample distribution becomes uniform Answer: B Explanation: LLN states convergence of sample average to expected value. Question 31. The null hypothesis in a one‑sample t‑test typically states that:
A) Sample mean equals a specified value B) Sample mean differs from a specified value C) Population variance is zero D) Data are not normally distributed Answer: A Explanation: H₀: μ = μ₀; test assesses evidence against this equality. Question 32. When the population standard deviation is known and n ≥ 30, which test is appropriate for a mean hypothesis? A) One‑sample Z‑test B) One‑sample t‑test C) Paired t‑test D) Wilcoxon signed‑rank test Answer: A Explanation: Known σ and large n justify Z‑test. Question 33. A two‑sample independent t‑test assumes: A) Equal variances only B) Independent samples and (often) equal variances C) Paired observations D) Categorical outcome Answer: B Explanation: Independence of groups; equal variances may be assumed (pooled) or not (Welch). Question 34. In ANOVA, the F‑statistic is the ratio of: A) Between‑group variance to within‑group variance
C) Sample size is less than 30 D) Data are skewed Answer: A Explanation: Correlation involves division by product of standard deviations; zero variance leads to division by zero. Question 38. Spearman’s rank correlation is appropriate when: A) Both variables are nominal B) Relationship is monotonic but not necessarily linear C) Data are normally distributed D) Variables have equal variances Answer: B Explanation: It measures monotonic association using ranks. Question 39. In simple linear regression, the slope coefficient β₁ represents: A) Change in Y for a one‑unit change in X B) Intercept of the regression line C) Correlation between X and Y D) Standard error of estimate Answer: A Explanation: β₁ quantifies the expected change in response per unit increase in predictor. Question 40. The coefficient of determination (R²) indicates: A) Proportion of variance in Y explained by X B) Correlation between X and Y C) Standard error of the slope
D) Probability of Type I error Answer: A Explanation: R² = 1 – (SS_res/SS_tot), measuring explained variance. Question 41. Multicollinearity in multiple regression inflates: A) Standard errors of coefficients B) R² value C) Intercept estimate D) Sample size Answer: A Explanation: Correlated predictors make coefficient estimates unstable, increasing SEs. Question 42. In logistic regression, the link function is: A) Identity B) Logit (log‑odds) C) Probit D) Exponential Answer: B Explanation: Logistic model uses logit link: log(p/(1‑p)) = β₀+β₁X. Question 43. The AUC (Area Under the ROC Curve) measures: A) Model’s ability to discriminate between classes B) Proportion of variance explained C) Calibration of predicted probabilities D) Number of predictors used
D) Increase model complexity Answer: B Explanation: KNN relies on Euclidean distance; scaling puts features on comparable units. Question 47. In PCA, the principal components are ordered by: A) Increasing variance explained B) Decreasing variance explained C) Alphabetical order of variable names D) Random selection Answer: B Explanation: First component captures most variance, subsequent components capture decreasing amounts. Question 48. The silhouette score is used to evaluate: A) Regression model accuracy B) Clustering cohesion and separation C) Classification recall D) Time‑series stationarity Answer: B Explanation: Silhouette measures how similar an object is to its own cluster vs. other clusters. Question 49. A time series that shows a consistent upward trend but no seasonality is said to be: A) Stationary B) Seasonal C) Trend‑stationary
D) Non‑stationary Answer: D Explanation: Presence of a deterministic trend violates stationarity. Question 50. In ARIMA(p,d,q), the parameter d represents: A) Number of autoregressive terms B) Number of differencing operations to achieve stationarity C) Number of moving‑average terms D) Seasonal period Answer: B Explanation: d is the order of integration (differences). Question 51. The p‑value of a hypothesis test is: A) Probability that H₀ is true B) Probability of observing data as extreme as observed, assuming H₀ is true C) Significance level α D) Probability of a Type II error Answer: B Explanation: p‑value quantifies evidence against H₀ under its truth. Question 52. Which of the following is a correct interpretation of a 99 % confidence interval? A) There is a 99 % probability that the true parameter lies inside the interval B) 99 % of future samples will produce intervals that contain the true parameter C) The interval will contain 99 % of the data points D) The interval width is 99 % of the sample mean
Explanation: t‑test is used when σ is unknown; known σ would lead to Z‑test. Question 56. In a paired t‑test, the test statistic is computed on: A) Raw scores of each group B) Differences between paired observations C) Sum of squares of each group D) Ratio of variances Answer: B Explanation: Paired test analyzes the mean of the differences. Question 57. The chi‑square goodness‑of‑fit test compares observed frequencies to: A) Expected frequencies under a specified distribution B) Frequencies of another variable C) Median values D) Means of continuous data Answer: A Explanation: It evaluates how well data follow a hypothesized distribution. Question 58. The F‑distribution is used in ANOVA because: A) It models the ratio of two independent chi‑square variables divided by their degrees of freedom B) It is symmetric about zero C) It has only one parameter D) It is discrete Answer: A Explanation: F = (SS_between/df1)/(SS_within/df2) follows an F‑distribution.
Question 59. Which metric is appropriate for evaluating a regression model’s predictive accuracy? A) Accuracy B) Mean Squared Error (MSE) C) ROC AUC D) Confusion matrix Answer: B Explanation: MSE measures average squared difference between predicted and actual values. Question 60. In the context of hypothesis testing, a “two‑tailed” test is used when: A) The alternative hypothesis specifies a direction B) The alternative hypothesis only states inequality (≠) C) Only one side of the distribution is of interest D) Sample size is small Answer: B Explanation: Two‑tailed tests consider deviations in both directions from H₀. Question 61. The bootstrap method is primarily used to: A) Increase sample size by duplicating data B) Estimate the sampling distribution of a statistic by resampling with replacement C) Perform parametric tests on non‑normal data D) Reduce dimensionality Answer: B Explanation: Bootstrap creates many resamples to approximate the distribution of an estimator.