




















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Data Science Associate (DSA) Certification Exam Preparation Guide introduces learners to the foundational concepts of data science. It covers data collection, data cleaning, exploratory data analysis, basic statistics, machine learning fundamentals, and data visualization techniques. The guide also touches on tools, programming concepts, and ethical considerations. Designed for beginners and early-career professionals, it offers step-by-step explanations, examples, and practice questions to build confidence and prepare candidates for the certification exam and entry-level data science roles.
Typology: Exams
1 / 92
This page cannot be seen from the preview
Don't miss anything!





















































































Question 1. Which of the following best describes the purpose of Exploratory Data Analysis (EDA)? A) To train a predictive model B) To clean data by removing duplicates C) To summarize main characteristics of a dataset and discover patterns D) To deploy a model into production Answer: C Explanation: EDA involves using summary statistics and visualizations to understand data, detect anomalies, and formulate hypotheses before modeling. Question 2. When handling missing values, which technique is most appropriate for a categorical variable with many levels? A) Mean imputation B) Deleting rows with missing values C) Imputing with the mode D) Using a predictive model for imputation Answer: D Explanation: For high-cardinality categorical variables, predictive imputation (e.g., using a classifier) preserves information better than simple mode replacement. Question 3. Which method is used to detect outliers based on the interquartile range (IQR)? A) Z-score > 3 B) Values below Q1 - 1.5IQR or above Q3 + 1.5IQR C) Mahalanobis distance > 5 D) DBSCAN labeling as noise
Answer: B Explanation: The IQR rule flags points that lie 1.5*IQR below the first quartile or above the third quartile as outliers. Question 4. What is the effect of Min-Max scaling on a feature? A) Centers data around zero with unit variance B) Transforms data to a specified range, usually [0,1] C) Removes correlation with other features D) Imposes a logarithmic transformation Answer: B Explanation: Min-Max scaling linearly rescales values so the minimum becomes 0 and the maximum becomes 1 (or any chosen range). Question 5. Which encoding technique is most suitable for an ordinal categorical variable (e.g., Low, Medium, High)? A) One-Hot Encoding B) Label Encoding C) Target Encoding D) Binary Encoding Answer: B Explanation: Label encoding preserves the intrinsic order of ordinal categories by assigning increasing integer values. Question 6. In SQL, which join returns all rows from the left table and matching rows from the right table, filling with NULLs when there is no match? A) INNER JOIN B) LEFT OUTER JOIN
Question 9. Which probability distribution models the number of successes in a fixed number of independent Bernoulli trials? A) Poisson B) Binomial C) Exponential D) Uniform Answer: B Explanation: The Binomial distribution counts successes over a set number of trials with constant success probability. Question 10. Bayes' Theorem is primarily used to: A) Compute confidence intervals B) Update prior probabilities with new evidence C) Test for independence between variables D) Estimate population variance Answer: B Explanation: Bayes' Theorem combines prior belief and likelihood of observed data to produce a posterior probability. Question 11. According to the Central Limit Theorem, the sampling distribution of the sample mean approaches a normal distribution as the sample size: A) Decreases B) Increases beyond 30 C) Equals the population size D) Remains unchanged Answer: B
Explanation: With sufficiently large samples (commonly n ≥ 30), the distribution of the sample mean becomes approximately normal regardless of the underlying population shape. Question 12. In hypothesis testing, a Type I error occurs when: A) The null hypothesis is incorrectly rejected B) The null hypothesis is incorrectly not rejected C) The alternative hypothesis is true but not tested D) The sample size is too small Answer: A Explanation: A Type I error is a false positive—rejecting a true null hypothesis. Question 13. Which test is appropriate for comparing means of more than two independent groups? A) Paired t-test B) One-way ANOVA C) Chi-square test D) Mann-Whitney U test Answer: B Explanation: One-way ANOVA assesses whether at least one group mean differs among three or more independent groups. Question 14. Pearson’s correlation coefficient measures: A) Linear relationship between two continuous variables B) Rank-based association C) Causal effect
B) Lasso (L1) C) Elastic Net D) Dropout Answer: B Explanation: Lasso (L1) regularization shrinks some coefficients to exactly zero, performing variable selection. Question 18. Logistic regression outputs: A) Continuous values between 0 and 1 interpreted as probabilities B) Class labels directly C) Decision boundaries only D) Cluster assignments Answer: A Explanation: Logistic regression models the log-odds and converts them to probabilities via the sigmoid function. Question 19. The k-Nearest Neighbors algorithm is sensitive to: A) Feature scaling B) Number of trees C) Learning rate D) Regularization strength Answer: A Explanation: Because distance calculations drive K-NN, unscaled features can dominate the metric, leading to biased neighbors. Question 20. Which kernel function allows an SVM to create non-linear decision boundaries by mapping data into higher dimensions?
A) Linear kernel B) Polynomial kernel C) RBF (Gaussian) kernel D) Both B and C Answer: D Explanation: Both polynomial and radial basis function (RBF) kernels enable SVMs to capture non-linear patterns. Question 21. Naïve Bayes classifiers assume: A) Features are dependent B) Features are conditionally independent given the class C) Linear separability of classes D) Equal class priors Answer: B Explanation: The “naïve” assumption is that each feature contributes independently to the probability of a class. Question 22. Which tree-based algorithm builds models sequentially, where each new tree corrects errors of the previous ensemble? A) Decision Tree B) Random Forest C) Gradient Boosting D) CART Answer: C Explanation: Gradient Boosting adds trees iteratively, each focusing on residuals of the combined previous trees.
Explanation: DBSCAN groups points based on density, allowing detection of non-convex shapes and labeling low-density points as noise. Question 26. The “elbow method” is used to determine: A) Optimal number of principal components B) Optimal number of clusters in K-Means C) Best regularization parameter for Lasso D) Ideal learning rate for gradient descent Answer: B Explanation: Plotting within-cluster sum of squares versus K and locating the “elbow” helps select an appropriate K. Question 27. Principal Component Analysis (PCA) is primarily used for: A) Supervised classification B) Reducing dimensionality while preserving variance C) Handling missing values D) Creating synthetic features Answer: B Explanation: PCA transforms correlated variables into orthogonal components that capture maximal variance. Question 28. t-Distributed Stochastic Neighbor Embedding (t-SNE) is most suitable for: A) Large-scale linear regression B) Visualizing high-dimensional data in 2-D or 3-D C) Time-series forecasting
D) Hyperparameter optimization Answer: B Explanation: t-SNE preserves local structure, making it ideal for visualizing complex high-dimensional datasets. Question 29. Which metric is appropriate for evaluating a binary classifier when the classes are highly imbalanced? A) Accuracy B) Precision-Recall AUC C) R² D) Mean Absolute Error Answer: B Explanation: Precision-Recall AUC focuses on the minority class performance and is more informative than accuracy in imbalanced settings. Question 30. A confusion matrix element representing false positives is: A) True Positive (TP) B) True Negative (TN) C) False Positive (FP) D) False Negative (FN) Answer: C Explanation: FP counts instances where the model predicts the positive class incorrectly. Question 31. In k-fold cross-validation, if k = 5, how many distinct training sets are created? A) 1
A) Grid Search explores a random subset of hyperparameter space B) Grid Search evaluates every combination on a predefined grid C) Random Search is deterministic D) Grid Search can only handle continuous parameters Answer: B Explanation: Grid Search exhaustively tests all specified hyperparameter combinations, while Random Search samples randomly. Question 35. Early stopping is a technique used to: A) Reduce the number of features before training B) Halt training when validation performance stops improving C) Increase learning rate automatically D) Convert a regression model to classification Answer: B Explanation: Early stopping prevents overfitting by terminating training once validation error ceases to improve. Question 36. Which plot is most effective for visualizing the distribution of a single continuous variable? A) Scatter plot B) Histogram C) Box plot D) Bar chart Answer: B Explanation: Histograms show frequency of value ranges, directly depicting distribution shape.
Question 37. A box plot displays all of the following EXCEPT: A) Median B) Interquartile range C) Mean D) Outliers Answer: C Explanation: Standard box plots typically show median, quartiles, and potential outliers, but not the mean. Question 38. Which visualization is best for comparing categorical frequencies across multiple groups? A) Stacked bar chart B) Scatter plot C) Heat map D) Line chart Answer: A Explanation: Stacked bars show the composition of categories within each group, facilitating comparison. Question 39. In Tableau, a “dashboard” is primarily used to: A) Write SQL queries B) Combine multiple visualizations into a single interactive view C) Perform data cleaning D) Deploy machine-learning models Answer: B Explanation: Dashboards aggregate several worksheets, allowing interactive exploration and storytelling.
Explanation: Balanced, diverse datasets reduce the risk that the model learns prejudiced patterns. Question 43. Which of the following is a best practice when handling Personally Identifiable Information (PII)? A) Store PII in plain text for easy access B) Encrypt PII at rest and in transit C) Share PII publicly for transparency D) Delete PII after 24 hours regardless of need Answer: B Explanation: Encryption protects PII from unauthorized access both while stored and during transmission. Question 44. In a time-series forecasting problem, which evaluation metric is most appropriate? A) R² B) Mean Absolute Percentage Error (MAPE) C) ROC-AUC D) Silhouette Score Answer: B Explanation: MAPE expresses forecast error as a percentage, making it intuitive for time-dependent predictions. Question 45. Which Python library provides the function train_test_split? A) pandas B) numpy C) scikit-learn
D) matplotlib Answer: C Explanation: train_test_split is part of scikit-learn’s model_selection module. Question 46. The “curse of dimensionality” refers to: A) Overfitting caused by too many observations B) Computational inefficiency when features are too many relative to samples C) Difficulty visualizing data in 2-D D) Increased interpretability with more features Answer: B Explanation: As dimensionality grows, data becomes sparse, distance metrics lose meaning, and models need exponentially more data. Question 47. Which technique can be used to reduce the impact of multicollinearity without dropping variables? A) Standardization B) Principal Component Analysis C) One-Hot Encoding D) Imputation Answer: B Explanation: PCA transforms correlated features into orthogonal components, eliminating multicollinearity. Question 48. In a binary classification problem, the threshold that maximizes the Youden’s J statistic is: A) The point where sensitivity = specificity B) The point where precision = recall
A) Mean Squared Error B) Hinge loss C) Binary Cross-Entropy (Log Loss) D) Poisson loss Answer: C Explanation: Binary cross-entropy measures the discrepancy between predicted probabilities and actual binary outcomes. Question 52. In a SQL query, which clause filters rows after aggregation? A) WHERE B) GROUP BY C) HAVING D) ORDER BY Answer: C Explanation: HAVING applies conditions to grouped results, whereas WHERE filters before aggregation. Question 53. Which of the following is a non-parametric test for comparing two independent distributions? A) Student’s t-test B) Paired t-test C) Mann-Whitney U test D) ANOVA Answer: C Explanation: Mann-Whitney U does not assume normality and compares ranks between two groups.
Question 54. The “lift” metric in marketing analytics measures: A) Increase in model training speed B) Improvement of model over random targeting C) Reduction in data storage cost D) Number of missing values handled Answer: B Explanation: Lift quantifies how much better a predictive model performs compared to random selection. Question 55. Which of the following statements about the ROC curve is true? A) It plots precision vs. recall B) It is insensitive to class imbalance C) It requires a probability threshold of 0. D) It cannot be used for multiclass problems Answer: B Explanation: ROC curves evaluate true positive rate vs. false positive rate across thresholds, making them relatively robust to imbalance. Question 56. In time-series decomposition, the “trend” component represents: A) Seasonal patterns repeating each period B) Random noise C) Long-term direction of the series D) Autocorrelation structure Answer: C