









































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The DEA-7TT2 Associate Data Science Version 2.0 Exam tests foundational knowledge in data science, including statistical analysis, data modeling, and machine learning. Candidates are evaluated on their ability to work with data sets, create visualizations, and apply algorithms. This certification is ideal for professionals beginning their data science careers or seeking to validate their skills in handling and analyzing data.
Typology: Exams
1 / 49
This page cannot be seen from the preview
Don't miss anything!










































Question 1: Which of the following best describes data science? A. The study of historical events B. The application of statistical methods to extract insights from data C. The design of computer hardware D. The management of financial accounts Answer: B Explanation: Data science focuses on using statistical and computational techniques to extract actionable insights from data. Question 2: Data preparation in a data science project primarily involves which of the following? A. Designing network protocols B. Collecting and cleaning raw data C. Writing business proposals D. Developing mobile applications Answer: B Explanation: Data preparation is all about gathering, cleaning, and organizing raw data for analysis. Question 3: Which component is NOT a key part of the data science lifecycle? A. Model planning B. Deployment C. Evaluation D. Financial auditing Answer: D Explanation: Financial auditing is not part of the data science lifecycle, which centers on discovery, data preparation, modeling, evaluation, deployment, and communication. Question 4: In comparing data science and business intelligence, what is a primary difference? A. Data science uses only qualitative data B. Business intelligence focuses on reporting historical data C. Data science ignores predictive modeling D. Business intelligence involves creating algorithms Answer: B Explanation: Business intelligence typically emphasizes historical data reporting, while data science is more focused on predictive modeling and discovering insights. Question 5: What is the main focus of the Discovery phase in the data science lifecycle? A. Building the final model B. Understanding the problem and available data C. Deploying the model D. Communicating results to stakeholders Answer: B Explanation: The Discovery phase involves understanding the business problem and identifying the relevant data to address it.
Question 6: Which role is primarily responsible for preparing and processing data in a data science team? A. Business analyst B. Data engineer C. Project manager D. Graphic designer Answer: B Explanation: Data engineers are tasked with gathering, cleaning, and organizing data for further analysis. Question 7: A data science project in healthcare may focus on which KPI? A. Website traffic B. Patient readmission rates C. Retail sales volume D. Social media engagement Answer: B Explanation: In healthcare, KPIs like patient readmission rates are important for assessing performance and outcomes. Question 8: Which phase of the data science lifecycle involves choosing the right model approach? A. Data Preparation B. Model Building C. Model Planning D. Communicating Results Answer: C Explanation: The Model Planning phase includes selecting the modeling technique and planning how to build the model. Question 9: What is the primary objective of model evaluation? A. To create colorful visuals B. To determine the effectiveness of a model C. To write code documentation D. To deploy applications to the cloud Answer: B Explanation: Model evaluation assesses how well a model performs, ensuring it meets the required standards and objectives. Question 10: Which real-world application of data science is most closely associated with predictive analytics in finance? A. Designing logos B. Fraud detection C. Cooking recipes D. Literature analysis Answer: B Explanation: Predictive analytics in finance often includes fraud detection by identifying unusual patterns in transactions.
C. Network topology D. API integration Answer: B Explanation: Box plots are effective for spotting outliers by displaying the median, quartiles, and potential extreme values. Question 17: What does PCA (Principal Component Analysis) primarily accomplish? A. Increases the number of features B. Reduces the dimensionality of data C. Cleans missing values D. Encrypts data Answer: B Explanation: PCA reduces the number of features by transforming them into a new set of variables that capture most of the variance. Question 18: In feature engineering, which method is used to transform categorical variables into numeric format? A. Encoding B. Normalization C. Encryption D. Aggregation Answer: A Explanation: Encoding techniques, such as one-hot encoding, convert categorical data into a numeric format suitable for modeling. Question 19: What measure is NOT a measure of central tendency? A. Mean B. Median C. Mode D. Variance Answer: D Explanation: Variance measures dispersion, not central tendency. Question 20: A p-value in hypothesis testing indicates which of the following? A. The cost of the experiment B. The probability of observing the test results under the null hypothesis C. The average data point D. The time required for data collection Answer: B Explanation: The p-value helps determine the significance of the results by measuring the probability of the observed data under the null hypothesis. Question 21: Which test is most appropriate for comparing the means of two independent groups? A. Chi-square test B. t-test C. ANOVA D. Correlation analysis
Answer: B Explanation: A t-test is used to compare the means between two independent groups. Question 22: What does a 95% confidence interval imply? A. The model will be 95% accurate B. There is a 95% chance that the interval contains the true parameter C. 95% of data points are within the interval D. The test is repeated 95 times Answer: B Explanation: A 95% confidence interval means that if the study were repeated many times, 95% of the calculated intervals would contain the true parameter. Question 23: Bayes’ Theorem is used to update the probability of an event based on which of the following? A. The event’s historical frequency only B. New evidence or data C. A fixed probability value D. The event’s popularity Answer: B Explanation: Bayes’ Theorem incorporates new evidence to update the probability of an event occurring. Question 24: Which probability distribution is characterized by its bell-shaped curve? A. Binomial distribution B. Poisson distribution C. Normal distribution D. Exponential distribution Answer: C Explanation: The Normal distribution is famous for its bell-shaped curve, representing many natural phenomena. Question 25: In hypothesis testing, what does it mean to reject the null hypothesis? A. The alternative hypothesis is accepted B. The experiment failed C. The null hypothesis is true D. Data is not significant Answer: A Explanation: Rejecting the null hypothesis supports the alternative hypothesis, suggesting that the observed effect is statistically significant. Question 26: Pearson’s correlation coefficient is best used to measure which relationship? A. Non-linear relationships B. Linear relationships between two continuous variables C. Categorical associations D. Cause and effect definitively Answer: B Explanation: Pearson’s correlation measures the linear relationship between two continuous variables.
B. Precision C. Recall D. Mean squared error Answer: D Explanation: Mean squared error is usually used for regression tasks, not classification. Question 33: What is the main goal of cross-validation in model training? A. To reduce the size of the data set B. To assess the generalization performance of a model C. To increase the training speed D. To generate more features Answer: B Explanation: Cross-validation helps to evaluate how the outcomes of a model will generalize to an independent data set. Question 34: What does overfitting refer to in machine learning? A. The model is too simple B. The model performs well on training data but poorly on new data C. The model is perfectly generalizable D. The data is insufficiently processed Answer: B Explanation: Overfitting occurs when a model captures noise along with the underlying pattern, leading to poor performance on unseen data. Question 35: Which technique is used to prevent overfitting in regression models? A. Data encryption B. Regularization (L1, L2) C. Feature scaling D. Hyperparameter tuning only Answer: B Explanation: Regularization techniques such as L1 and L2 add a penalty for larger coefficients to prevent overfitting. Question 36: What is the main idea behind ensemble methods? A. Using a single model for prediction B. Combining multiple models to improve prediction accuracy C. Reducing the training data D. Simplifying model interpretation Answer: B Explanation: Ensemble methods aggregate predictions from multiple models to achieve better performance than individual models. Question 37: Which ensemble method builds multiple decision trees with bootstrapped samples? A. Gradient Boosting B. Bagging C. Support Vector Machines D. k-NN
Answer: B Explanation: Bagging (Bootstrap Aggregating) builds several models on different random subsets of data and averages their predictions. Question 38: What does the term “boosting” refer to in ensemble methods? A. Combining weak learners sequentially to form a strong learner B. Randomly selecting features C. Clustering data points D. Reducing computational complexity Answer: A Explanation: Boosting involves training weak learners sequentially, where each new learner attempts to correct errors made by the previous ones. Question 39: In supervised learning, which algorithm is typically used for classification problems? A. Linear regression B. Logistic regression C. Principal Component Analysis D. Hierarchical clustering Answer: B Explanation: Logistic regression is commonly used for binary classification problems. Question 40: Which algorithm is a non-parametric method used for classification? A. k-Nearest Neighbors (k-NN) B. Linear regression C. ANOVA D. PCA Answer: A Explanation: k-NN is a non-parametric algorithm that classifies data points based on the classes of their nearest neighbors. Question 41: Which of the following best describes unsupervised learning? A. Learning with labeled outcomes B. Learning from data without predefined labels C. Reinforcement of known patterns D. Using pre-trained models Answer: B Explanation: Unsupervised learning finds hidden patterns or intrinsic structures in data without using labeled responses. Question 42: k-Means clustering aims to partition data into how many clusters? A. Two clusters B. A predefined number k clusters C. As many clusters as data points D. Clusters based on decision trees Answer: B Explanation: k-Means clustering partitions data into a user-specified number (k) of clusters based on similarity.
B. Statistical analysis and visualization C. Operating systems D. Video game design Answer: B Explanation: R is renowned for its statistical analysis capabilities and extensive visualization libraries. Question 49: Which tool is commonly used to create interactive data science notebooks? A. Visual Studio Code B. Jupyter Notebook C. Adobe Photoshop D. Microsoft Word Answer: B Explanation: Jupyter Notebooks are popular in data science for combining code, visualization, and narrative text. Question 50: Which library is used for building machine learning models in Python? A. Scikit-learn B. TensorFlow only C. Keras exclusively D. Flask Answer: A Explanation: Scikit-learn is a comprehensive machine learning library in Python that provides many algorithms for classification, regression, and clustering. Question 51: Which Python library is used primarily for numerical computations? A. NumPy B. Pandas C. Matplotlib D. Seaborn Answer: A Explanation: NumPy is the standard library in Python for numerical computations, particularly array and matrix operations. Question 52: In R, which package is most commonly used for data manipulation? A. dplyr B. ggplot C. shiny D. caret Answer: A Explanation: The dplyr package in R provides a powerful grammar for data manipulation. Question 53: What is the main benefit of writing reusable functions in data science projects? A. They increase code redundancy B. They allow for efficient and error-free processing of repetitive tasks C. They decrease code readability D. They make code execution slower Answer: B
Explanation: Reusable functions enhance efficiency and maintainability by reducing repetition and potential errors. Question 54: A data pipeline is used for which purpose? A. Encrypting data B. Automating the flow of data through different processing stages C. Creating user interfaces D. Writing research papers Answer: B Explanation: Data pipelines automate processes such as extraction, transformation, and loading (ETL) of data. Question 55: What is the primary function of SQL in data management? A. To visualize data B. To query and manage relational databases C. To build machine learning models D. To encrypt sensitive data Answer: B Explanation: SQL is the standard language for querying and managing data stored in relational databases. Question 56: Which NoSQL database is designed for high scalability and handling large volumes of data? A. MySQL B. MongoDB C. Oracle D. SQLite Answer: B Explanation: MongoDB is a popular NoSQL database known for its scalability and flexibility in handling large volumes of unstructured data. Question 57: What is the primary difference between a data lake and a data warehouse? A. Data lakes store unstructured data; data warehouses store structured data B. Data warehouses store raw data; data lakes store processed data C. They are exactly the same D. Data lakes require SQL; data warehouses do not Answer: A Explanation: Data lakes are designed to store unstructured or semi-structured raw data, while data warehouses store cleaned and structured data for analysis. Question 58: Hadoop’s HDFS is best described as what? A. A type of machine learning model B. A distributed file system for storing big data C. A data visualization tool D. A cloud computing platform Answer: B
Question 64: What does SHAP provide in the context of model interpretation? A. A method to increase model complexity B. A framework to explain individual predictions by assigning feature contributions C. A visualization for data pipelines D. A data encryption standard Answer: B Explanation: SHAP (SHapley Additive exPlanations) helps in understanding the contribution of each feature in the prediction. Question 65: LIME is used to: A. Enhance the speed of data processing B. Provide local interpretable model-agnostic explanations C. Generate synthetic data D. Improve network security Answer: B Explanation: LIME (Local Interpretable Model-agnostic Explanations) explains individual predictions regardless of the underlying model. Question 66: Which visualization tool is popular for creating dashboards in data science? A. Tableau B. Notepad C. Eclipse D. Photoshop Answer: A Explanation: Tableau is widely used for creating interactive dashboards and visualizations for business insights. Question 67: In communicating results, why is storytelling with data important? A. It increases the size of the report B. It makes complex data accessible to non-technical stakeholders C. It hides the data inaccuracies D. It replaces the need for data analysis Answer: B Explanation: Storytelling helps translate complex analyses into clear insights that are easily understood by all stakeholders. Question 68: Which of the following best defines Personally Identifiable Information (PII)? A. Data related to weather patterns B. Information that can be used to identify a specific individual C. Anonymous survey results D. Public domain literature Answer: B Explanation: PII includes any data that could potentially identify a specific person, such as name, social security number, or email. Question 69: Data encryption is primarily used to: A. Speed up data processing
B. Protect sensitive data from unauthorized access C. Create visualizations D. Generate synthetic data Answer: B Explanation: Encryption secures data by converting it into a code that cannot be easily deciphered by unauthorized users. Question 70: GDPR is a regulation primarily concerned with: A. Financial auditing B. Data privacy and protection in the European Union C. Machine learning algorithms D. Cloud deployment strategies Answer: B Explanation: The General Data Protection Regulation (GDPR) sets guidelines for data privacy and security within the EU. Question 71: What is algorithmic bias? A. A method to increase computational speed B. A systematic error in a machine learning model that produces unfair outcomes C. The process of encrypting algorithms D. An unrelated financial term Answer: B Explanation: Algorithmic bias occurs when a model produces prejudiced results due to flawed assumptions in the machine learning process. Question 72: Which of the following is an example of regulatory compliance in data science? A. Using open-source software without a license B. Adhering to HIPAA standards in healthcare data management C. Ignoring data privacy laws D. Deploying models without testing Answer: B Explanation: HIPAA (Health Insurance Portability and Accountability Act) sets standards for protecting sensitive patient data, ensuring regulatory compliance. Question 73: What is a primary goal of data governance frameworks? A. To centralize data and manage its quality, security, and availability B. To develop new programming languages C. To increase data redundancy D. To eliminate the need for data backups Answer: A Explanation: Data governance frameworks help organizations manage data quality, security, and availability systematically. Question 74: In business problem framing, what is the first step? A. Deploying the model B. Translating a business objective into a data science problem C. Cleaning the data
Explanation: Feature engineering transforms raw data into features that improve the performance of machine learning models. Question 80: In data science, what is the purpose of dimensionality reduction? A. To increase the number of variables B. To simplify data by reducing the number of features while retaining most of the variance C. To encrypt data D. To generate more missing values Answer: B Explanation: Dimensionality reduction techniques like PCA simplify datasets by reducing the number of variables while maintaining essential information. Question 81: Which of the following is a key component of descriptive statistics? A. Hypothesis testing B. Measures of central tendency and dispersion C. Predictive modeling D. Neural networks Answer: B Explanation: Descriptive statistics summarize and describe the features of a dataset using measures such as mean, median, mode, variance, and standard deviation. Question 82: Which test would be most suitable for comparing the means of more than two groups? A. t-test B. ANOVA C. Correlation analysis D. Regression analysis Answer: B Explanation: ANOVA (Analysis of Variance) is used to compare the means of three or more groups simultaneously. Question 83: The Binomial distribution is most appropriate when: A. Data follows a normal pattern B. There are exactly two outcomes (success or failure) in each trial C. Data is continuous D. There are more than two categorical outcomes Answer: B Explanation: The Binomial distribution models the number of successes in a fixed number of independent binary trials. Question 84: Which of the following is an assumption of the t-test? A. Data must be non-normally distributed B. The samples must be independent and normally distributed C. The data is always categorical D. The variance must be zero Answer: B Explanation: The t-test assumes that the samples are independent and come from normally distributed populations with similar variances.
Question 85: What is the primary purpose of correlation analysis? A. To establish causality B. To measure the strength and direction of a linear relationship between two variables C. To create models D. To calculate data encryption keys Answer: B Explanation: Correlation analysis quantifies the degree to which two variables move in relation to each other, without implying causation. Question 86: Which correlation coefficient is best for measuring monotonic relationships? A. Pearson’s B. Spearman’s C. Chi-square D. Standard deviation Answer: B Explanation: Spearman’s rank correlation coefficient is ideal for measuring monotonic relationships, especially when the data is not normally distributed. Question 87: In linear regression, what does the slope coefficient represent? A. The intercept value B. The change in the dependent variable for a one-unit change in the independent variable C. The sample size D. The model’s accuracy percentage Answer: B Explanation: The slope in a regression equation quantifies how much the dependent variable is expected to increase or decrease as the independent variable increases by one unit. Question 88: Which machine learning algorithm is best described as a “lazy learner”? A. Decision Tree B. k-NN C. Support Vector Machine D. Random Forest Answer: B Explanation: k-NN is known as a lazy learner because it does not build an explicit model until it is required to make a prediction. Question 89: What is the key characteristic of decision trees? A. They use linear equations only B. They split data into branches based on feature values C. They always produce a continuous output D. They require neural networks Answer: B Explanation: Decision trees split the dataset into branches based on feature thresholds to arrive at a decision or classification. Question 90: Which evaluation metric is most sensitive to class imbalances? A. Accuracy
C. Writing a new programming language D. Converting categorical data to text Answer: B Explanation: Hyperparameter tuning involves adjusting settings (like learning rate, depth, etc.) that control how a model is trained. Question 96: Which library in Python is most commonly used for data visualization? A. Scikit-learn B. Matplotlib C. TensorFlow D. NumPy Answer: B Explanation: Matplotlib is the standard Python library for creating static, interactive, and animated visualizations. Question 97: Which of the following is a primary benefit of using cloud platforms in data science? A. They reduce the need for data encryption B. They offer scalable computing resources for big data processing C. They eliminate the need for data preparation D. They replace all programming languages Answer: B Explanation: Cloud platforms like AWS, Azure, and Google Cloud provide scalable resources, enabling efficient handling of large data volumes. Question 98: In a collaborative data science environment, which tool is widely used for version control? A. GitHub B. Excel C. PowerPoint D. Notepad Answer: A Explanation: GitHub is a popular platform for version control and collaboration using Git. Question 99: When deploying a model, what is the primary difference between batch and real-time deployment? A. Batch deployment processes data in groups; real-time processes data instantly as it arrives B. They are identical C. Real-time deployment is used only for images D. Batch deployment requires no monitoring Answer: A Explanation: Batch deployment processes data at scheduled intervals, while real-time deployment handles data instantly as it is received. Question 100: What is model retraining and why is it important? A. Rewriting the code from scratch B. Updating the model with new data to maintain its accuracy C. Increasing the dataset size unnecessarily
D. Reducing the number of features Answer: B Explanation: Model retraining involves periodically updating the model with fresh data to ensure its performance remains robust over time. Question 101: Which tool is used for building interactive dashboards that communicate results? A. Power BI B. MATLAB C. Notepad++ D. Visual Studio Code Answer: A Explanation: Power BI is widely used to create interactive dashboards that effectively communicate complex data insights. Question 102: What does the term “explainability” in model interpretation refer to? A. The speed at which a model runs B. The ability to understand and interpret how a model makes decisions C. The size of the dataset D. The complexity of the code Answer: B Explanation: Explainability refers to how well the inner workings of a model can be understood by humans, particularly for decision-making. Question 103: Which of the following is a common method to present data science findings to non- technical stakeholders? A. Detailed code listings B. Interactive dashboards and visualizations C. Raw data tables D. Command line logs Answer: B Explanation: Visualizations and dashboards make complex data more accessible and understandable for non-technical audiences. Question 104: What is a primary ethical concern when building machine learning models? A. Ensuring models are developed using the latest programming language B. Preventing algorithmic bias and ensuring fairness C. Increasing the model’s size D. Maximizing computational cost Answer: B Explanation: Ethical concerns include avoiding algorithmic bias and ensuring that models are fair and do not propagate discrimination. Question 105: Which practice is essential for protecting data privacy during data collection? A. Publicly sharing all raw data B. Anonymizing personally identifiable information (PII) C. Storing data without encryption D. Ignoring consent requirements