Kizen Data Scientist Practice Exam, Exams of Technology

This exam assesses advanced statistical, machine learning, and data engineering skills. Areas include supervised/unsupervised learning, Python/R programming, model evaluation, feature engineering, neural networks, big-data technologies, and deployment strategies. Real-world case problems cover handling messy datasets, building predictive models, applying algorithms, A/B testing, and developing data-driven business solutions across industries.

Typology: Exams

2025/2026

Available from 01/07/2026

shilpi-jain-1
shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 90

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Kizen Data Scientist Practice Exam
**Question 1. Which of the following best describes the primary distinction between Artificial
Intelligence (AI) and Machine Learning (ML)?**
A) AI is a subset of ML.
B) ML is a subset of AI.
C) Both are identical concepts.
D) AI focuses only on robotics.
Answer: B
Explanation: Machine Learning is a specific approach within the broader field of Artificial Intelligence
that enables systems to learn from data.
**Question 2. In the context of a datadriven organization, which role is most responsible for translating
business objectives into analytical tasks?**
A) Data Engineer
B) Data Analyst
C) Data Scientist
D) Database Administrator
Answer: C
Explanation: Data Scientists bridge the gap between business problems and analytical solutions by
formulating hypotheses, selecting models, and interpreting results.
**Question 3. The Data and Analytics Maturity Framework’s “Optimized” stage is characterized by
which of the following?**
A) Adhoc reporting
B) Centralized data governance
C) Predictive analytics embedded in business processes
D) Manual data collection
Answer: C
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a

Partial preview of the text

Download Kizen Data Scientist Practice Exam and more Exams Technology in PDF only on Docsity!

Question 1. Which of the following best describes the primary distinction between Artificial Intelligence (AI) and Machine Learning (ML)? A) AI is a subset of ML. B) ML is a subset of AI. C) Both are identical concepts. D) AI focuses only on robotics. Answer: B Explanation: Machine Learning is a specific approach within the broader field of Artificial Intelligence that enables systems to learn from data. Question 2. In the context of a data‑driven organization, which role is most responsible for translating business objectives into analytical tasks? A) Data Engineer B) Data Analyst C) Data Scientist D) Database Administrator Answer: C Explanation: Data Scientists bridge the gap between business problems and analytical solutions by formulating hypotheses, selecting models, and interpreting results. Question 3. The Data and Analytics Maturity Framework’s “Optimized” stage is characterized by which of the following? A) Ad‑hoc reporting B) Centralized data governance C) Predictive analytics embedded in business processes D) Manual data collection Answer: C

Explanation: At the Optimized stage, organizations use predictive and prescriptive analytics operationally, integrating insights into decision making. Question 4. Which Kaizen principle emphasizes continuous, incremental improvement rather than large, disruptive changes? A) Breakthrough Innovation B) Incremental Change C) Radical Transformation D) One‑time Optimization Answer: B Explanation: Kaizen focuses on small, ongoing improvements that cumulatively lead to significant performance gains. Question 5. Which of the following is NOT a typical component of data governance? A) Data lineage tracking B) Model hyper‑parameter tuning C) Data quality standards D) Access control policies Answer: B Explanation: Model hyper‑parameter tuning is a modeling activity, whereas data governance deals with data management policies, quality, and security. Question 6. In linear algebra, the eigenvalue of a matrix is best defined as: A) The sum of the matrix’s diagonal elements. B) A scalar λ such that Av = λv for some non‑zero vector v. C) The determinant of the matrix.

D) Standard deviation Answer: B Explanation: The median is robust to extreme values and better represents central tendency in skewed data. Question 10. The variance of a dataset can be interpreted as: A) The average squared deviation from the mean. B) The square root of the standard deviation. C) The sum of absolute deviations. D) The range divided by two. Answer: A Explanation: Variance quantifies dispersion by averaging the squared differences between each observation and the mean. Question 11. Which probability distribution is most appropriate for modeling the number of successes in a fixed number of independent Bernoulli trials? A) Poisson B) Normal C) Binomial D) Exponential Answer: C Explanation: The Binomial distribution describes counts of successes across a fixed number of trials with constant success probability. Question 12. Bayes’ theorem is primarily used to: A) Compute the probability of the union of events.

B) Update prior probabilities with new evidence. C) Determine the expected value of a random variable. D) Approximate integrals. Answer: B Explanation: Bayes’ theorem provides a framework for revising beliefs (priors) in light of observed data (likelihood). Question 13. The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size: A) Decreases to zero. B) Increases, regardless of the population distribution. C) Remains constant. D) Depends on the population variance only. Answer: B Explanation: With sufficiently large samples, the distribution of the sample mean becomes approximately normal, even if the underlying population is not normal. Question 14. A 95 % confidence interval for a population mean implies that: A) There is a 95 % probability the true mean lies within the interval. B) 95 % of sample means will fall inside the interval. C) If we repeated the experiment many times, 95 % of the constructed intervals would contain the true mean. D) The interval contains 95 % of the data points. Answer: C Explanation: Confidence intervals reflect long‑run frequency: 95 % of such intervals would capture the true parameter.

Question 18. In Python, which library provides the DataFrame data structure most commonly used for data manipulation? A) NumPy B) SciPy C) Pandas D) Matplotlib Answer: C Explanation: Pandas introduces the DataFrame, a tabular data structure with powerful indexing and manipulation capabilities. Question 19. Which NumPy function creates an array of evenly spaced values over a specified interval? A) np.arange() B) np.linspace() C) np.random.rand() D) np.zeros() Answer: B Explanation: np.linspace(start, stop, num) returns ‘num’ evenly spaced samples between start and stop. Question 20. In Pandas, the method used to combine two DataFrames horizontally based on a common key is: A) concat() B) merge() C) join() D) pivot()

Answer: B Explanation: merge() performs database‑style joins (inner, left, right, outer) on one or more keys. Question 21. Which SQL clause is used to filter rows after aggregation? A) WHERE B) GROUP BY C) HAVING D) ORDER BY Answer: C Explanation: HAVING applies conditions to aggregated results, whereas WHERE filters before aggregation. Question 22. In a relational database, a primary key must be: A) Nullable. B) Unique and non‑null. C) A foreign key. D) Composite by default. Answer: B Explanation: Primary keys uniquely identify rows and cannot contain NULL values. Question 23. Which NoSQL database type is best suited for storing hierarchical JSON documents? A) Column‑family store B) Graph database C) Document store D) Key‑value store

C) Replacing with zero. D) Using forward fill. Answer: B Explanation: Mean imputation provides a simple unbiased estimate when data are missing completely at random. Question 27. Outlier detection using the Interquartile Range (IQR) classifies a point as an outlier if it lies beyond: A) 1.5 × IQR below Q1 or above Q3. B) 2 × IQR below Q1 or above Q3. C) 3 × standard deviation from the mean. D) The median ± IQR. Answer: A Explanation: The common IQR rule flags values < Q1 - 1.5·IQR or > Q3 + 1.5·IQR as outliers. Question 28. Feature scaling that transforms data to have zero mean and unit variance is called: A) Min‑max scaling. B) Log transformation. C) Standardization. D) Binning. Answer: C Explanation: Standardization (z‑score scaling) subtracts the mean and divides by the standard deviation. Question 29. Which of the following is a dimensionality‑reduction technique that preserves maximum variance? A) Linear Discriminant Analysis (LDA)

B) Principal Component Analysis (PCA) C) t‑Distributed Stochastic Neighbor Embedding (t‑SNE) D) Independent Component Analysis (ICA) Answer: B Explanation: PCA finds orthogonal axes (principal components) that capture the greatest variance in the data. Question 30. In K‑Means clustering, the algorithm converges when: A) All points are assigned to the same cluster. B) Cluster centroids no longer move between iterations. C) The number of clusters equals the number of data points. D) The within‑cluster sum of squares increases. Answer: B Explanation: Convergence is reached when centroid positions stabilize, indicating no further reassignment changes. Question 31. Which evaluation metric is most appropriate for imbalanced binary classification where false negatives are more costly than false positives? A) Accuracy B) Precision C) Recall D) ROC‑AUC Answer: C Explanation: Recall (sensitivity) measures the proportion of actual positives correctly identified, emphasizing detection of the minority class.

Explanation: Bootstrap aggregating (bagging) creates diverse trees, and averaging their predictions reduces variance. Question 35. Gradient Boosting Machines differ from Random Forests because they: A) Build trees in parallel. B) Sequentially add trees that correct previous errors. C) Use only one tree. D) Do not perform any form of bagging. Answer: B Explanation: Gradient Boosting builds trees one after another, each focusing on residuals of the ensemble so far. Question 36. Which regularization technique adds the absolute value of coefficients to the loss function? A) L1 (Lasso) B) L2 (Ridge) C) Elastic Net D) Dropout Answer: A Explanation: L1 regularization (Lasso) penalizes the sum of absolute coefficient values, encouraging sparsity. Question 37. In Support Vector Machines (SVM), the kernel trick allows: A) Linear separation of non‑linear data by mapping to higher dimensions. B) Reduction of dataset size. C) Automatic feature selection.

D) Direct computation of probabilities. Answer: A Explanation: Kernels implicitly transform data into higher‑dimensional spaces where a linear separator may exist. Question 38. The Naïve Bayes classifier assumes which of the following about predictor variables? A) They are normally distributed. B) They are conditionally independent given the class. C) They have equal variance. D) They are all categorical. Answer: B Explanation: Naïve Bayes simplifies computation by assuming conditional independence among features. Question 39. Which performance metric for regression is expressed as the proportion of variance explained by the model? A) Mean Absolute Error (MAE) B) Root Mean Squared Error (RMSE) C) R‑squared (Coefficient of Determination) D) Adjusted R‑squared Answer: C Explanation: R‑squared quantifies the fraction of total variance in the dependent variable accounted for by the model. Question 40. In time‑series decomposition, the component that captures repeating patterns at fixed intervals is called: A) Trend

Question 43. Recurrent Neural Networks (RNNs) are particularly suited for which type of data? A. Tabular data B. Image data C. Sequential data (e.g., text, time series) D. Graph data Answer: C Explanation: RNNs maintain hidden states that capture temporal dependencies, making them ideal for sequences. Question 44. In Natural Language Processing, which preprocessing step converts the sentence “Data Science is awesome!” into tokens “data”, “science”, “awesome”? A. Stemming B. Lemmatization C. Stop‑word removal only D. Lowercasing, punctuation removal, and tokenization Answer: D Explanation: Lowercasing, removing punctuation, and tokenizing produce the desired tokens; stemming/lemmatization would further modify word forms. Question 45. Sentiment analysis typically outputs which type of result? A. Continuous numeric rating from 0 to 1 B. Categorical label (e.g., positive, neutral, negative) C. Part‑of‑speech tags D. Named entities Answer: B

Explanation: Sentiment analysis classifies text into sentiment categories, often positive, neutral, or negative. Question 46. Which visualization is most appropriate for comparing the distribution of a numeric variable across several categorical groups? A. Scatter plot B. Box plot C. Line chart D. Pie chart Answer: B Explanation: Box plots display median, quartiles, and outliers for each category, facilitating distribution comparison. Question 47. In Matplotlib, the function plt.hist() is used to create: A. A bar chart B. A histogram C. A line plot D. A scatter plot Answer: B Explanation: hist() computes and displays the frequency distribution of a numeric variable as a histogram. Question 48. Which of the following best describes a “dashboard” in a business intelligence context? A. A static PDF report. B. An interactive collection of visualizations that provide real‑time insights. C. A single scatter plot.

B. Share the host OS kernel. C. Require hypervisors. D. Cannot be version‑controlled. Answer: B Explanation: Containers isolate applications but use the host operating system’s kernel, making them lightweight compared to VMs. Question 52. In MLOps, continuous integration (CI) for a model pipeline typically includes: A. Manual model training. B. Automated testing of code, data schemas, and model performance. C. Deploying directly to production without testing. D. Ignoring version control. Answer: B Explanation: CI automates verification steps (unit tests, data validation, performance checks) to ensure reliability before deployment. Question 53. Which SQL statement would you use to permanently remove a table named sales_data? A. DELETE FROM sales_data; B. DROP TABLE sales_data; C. TRUNCATE sales_data; D. ALTER TABLE sales_data; Answer: B Explanation: DROP TABLE deletes the table definition and all its data from the database. Question 54. A “foreign key” constraint enforces which of the following?

A. Uniqueness of values within the same table. B. That values in a column must exist as primary key values in another table. C. That a column cannot contain NULLs. D. Automatic indexing of the column. Answer: B Explanation: Foreign keys maintain referential integrity by ensuring referenced values exist in the parent table. Question 55. In a Hadoop ecosystem, which component is responsible for resource management and job scheduling? A. HDFS B. MapReduce C. YARN D. Hive Answer: C Explanation: YARN (Yet Another Resource Negotiator) allocates cluster resources and schedules tasks. Question 56. Which Spark transformation is lazy and only executed when an action is called? A. collect() B. map() C. reduce() D. count() Answer: B Explanation: map() creates a new RDD lazily; execution occurs when an action such as collect() or count() is invoked.