


















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The PrepIQ AIDSAI Certified Data Mining CDMP Ultimate Exam focuses on extracting meaningful insights from large datasets using classification, clustering, association analysis, predictive modeling, pattern recognition, and advanced data mining methodologies for business intelligence applications.
Typology: Exams
1 / 58
This page cannot be seen from the preview
Don't miss anything!



















































Question 1. Which of the following best describes the primary purpose of a Data Governance Office (DGO)? A) To develop machine-learning models for production use B) To define policies, roles, and responsibilities for data usage across the organization C) To manage hardware provisioning for data warehouses D) To perform day-to-day data entry tasks Answer: B Explanation: The DGO establishes governance frameworks, sets policies, and assigns accountability for data assets, ensuring consistent, lawful, and ethical data handling. Question 2. Under the GDPR, what is the legal basis that allows an organization to process personal data without explicit consent? A) Legitimate interest, provided the processing does not override the data subject’s rights B) Data minimization principle C) Right to be forgotten D) Data portability requirement Answer: A Explanation: GDPR permits processing based on legitimate interests when the organization’s need does not unduly affect the individual’s rights and freedoms. Question 3. In the data management lifecycle, which phase is primarily concerned with converting raw data into a format suitable for analysis? A) Data creation B) Data archiving
C) Data transformation D) Data disposal Answer: C Explanation: Data transformation (often part of ETL) reshapes, cleans, and formats raw data, making it ready for analytical consumption. Question 4. Which of the following is a key characteristic of a “golden record” in master data management? A) It is a temporary snapshot used only during data migration B) It is the single, authoritative version of a data entity across the enterprise C) It is a backup copy stored in a cold archive D) It is a synthetic dataset generated for testing purposes Answer: B Explanation: A golden record consolidates duplicate records into one definitive, trusted version used throughout the organization. Question 5. Normalization in relational database design primarily aims to: A) Reduce data redundancy and improve update consistency B) Increase query speed for analytical workloads C) Simplify data warehouse star schemas D) Enable storage of unstructured data Answer: A Explanation: Normalization organizes tables to minimize duplication and ensure data integrity, which is essential for OLTP systems. Question 6. Which modeling approach is most suitable for representing hierarchical relationships such as product categories and sub-categories? A) Snowflake schema
Question 9. The CAP theorem states that a distributed system can simultaneously provide at most two of the following properties. Which pair is guaranteed by most NoSQL databases optimized for high availability? A) Consistency and Partition tolerance B) Availability and Consistency C) Availability and Partition tolerance D) None of the above – all three are always achievable Answer: C Explanation: Many NoSQL systems prioritize availability and partition tolerance, sacrificing strict consistency during network partitions. Question 10. Which of the following is a characteristic of OLAP systems? A) They are optimized for high-velocity transaction processing B) They store data in a normalized relational format to minimize redundancy C) They support multi-dimensional queries and complex aggregations D) They enforce ACID properties for each individual row operation Answer: C Explanation: OLAP (Online Analytical Processing) is designed for analytical queries that involve slicing, dicing, and aggregating across dimensions. Question 11. In Hadoop’s MapReduce paradigm, what is the role of the “shuffle” phase? A) To compress output files for storage efficiency B) To sort and transfer intermediate key-value pairs from mappers to reducers C) To execute the final aggregation logic on the master node D) To validate input data against a schema before mapping Answer: B
Explanation: The shuffle step sorts and moves mapper output to the appropriate reducer based on keys. Question 12. Which of the following best defines “data virtualization”? A) Replicating data across multiple physical servers for redundancy B) Providing a unified logical view of data without moving or copying it physically C) Transforming structured data into a semi-structured JSON format D) Encrypting data at rest in a data warehouse Answer: B Explanation: Data virtualization abstracts underlying data sources, allowing users to query them as if they were a single entity. Question 13. An enterprise adopts an “hub-and-spoke” integration pattern. Which statement accurately reflects this architecture? A) Each system communicates directly with every other system via point-to-point adapters B) All data flows through a central hub that performs routing, transformation, and orchestration C) The pattern eliminates the need for any middleware or messaging layer D) It is synonymous with a monolithic application design Answer: B Explanation: In a hub-and-spoke model, a central hub mediates data exchange, simplifying connectivity and governance. Question 14. Which API design principle promotes “loose coupling” between a data mining service and its consumers? A) Synchronous blocking calls B) Tight schema binding in request payloads
B) Trace the origin, transformations, and destinations of data elements for impact analysis and compliance C) Increase query performance by indexing lineage tables D) Encrypt data at the point of entry only Answer: B Explanation: Lineage provides transparency on how data moves and changes, supporting auditing, debugging, and regulatory reporting. Question 18. Which of the following is NOT a typical dimension of data quality? A) Accuracy B) Completeness C) Scalability D) Timeliness Answer: C Explanation: Scalability refers to system performance, whereas data quality dimensions focus on the correctness and usefulness of data. Question 19. During data profiling, which statistical measure helps identify the uniqueness of values in a column? A) Mean B) Standard deviation C) Cardinality (distinct count) D) Median Answer: C Explanation: Cardinality counts distinct values, indicating how many unique entries a column contains, useful for key detection.
Question 20. Role-Based Access Control (RBAC) differs from attribute-based access control (ABAC) primarily in that RBAC: A) Grants permissions based on user attributes such as department or clearance level B) Assigns permissions to roles, and users acquire those permissions by being assigned to roles C) Uses machine learning to infer access decisions dynamically D) Is only applicable to cloud environments Answer: B Explanation: RBAC groups permissions into roles; users inherit access rights through role membership, simplifying management. Question 21. Which cryptographic technique is most appropriate for protecting data at rest in a data lake? A) TLS/SSL B) Asymmetric RSA encryption only for data in transit C) Transparent Data Encryption (TDE) using symmetric keys D) Hashing without a salt Answer: C Explanation: TDE encrypts stored data transparently with symmetric keys, making it suitable for large volumes at rest. Question 22. In predictive analytics, a model that outputs a probability of class membership rather than a hard label is known as: A) A deterministic classifier B) A probabilistic classifier C) A clustering algorithm D) A dimensionality reduction technique
C) Agglomerative hierarchical clustering D) Gaussian Mixture Models Answer: C Explanation: Agglomerative clustering merges individual points into larger clusters, forming a dendrogram that represents nested groupings. Question 26. In a decision-tree classifier, the Gini impurity measure is used to: A) Determine the optimal number of leaf nodes after pruning B) Quantify the homogeneity of a node; lower values indicate purer splits C) Compute the probability of overfitting the training data D) Convert continuous variables into categorical bins automatically Answer: B Explanation: Gini impurity assesses how mixed the classes are within a node; a split that reduces impurity is preferred. Question 27. Which regression technique is most suitable when the dependent variable is binary? A) Linear regression B) Ridge regression C) Logistic regression D) Poisson regression Answer: C Explanation: Logistic regression models the log-odds of a binary outcome, producing probabilities bounded between 0 and 1. Question 28. Cross-validation helps to: A) Increase the size of the training dataset by duplication
B) Reduce model bias by adding more features automatically C) Estimate a model’s generalization performance by partitioning data into training and validation folds D) Convert unsupervised learning problems into supervised ones Answer: C Explanation: Cross-validation repeatedly trains and tests on different data splits, providing a robust performance estimate. Question 29. Which of the following statements about “overfitting” is correct? A) Overfitting occurs when a model is too simple to capture the underlying pattern B) Overfitting improves performance on unseen data C) Overfitting can be mitigated by regularization, pruning, or reducing model complexity D) Overfitting is only a concern for clustering algorithms Answer: C Explanation: Overfitting means the model captures noise in training data; techniques like regularization and pruning help generalize better. Question 30. In the context of data mining, a “confusion matrix” provides information about: A) The correlation between two continuous variables B) The frequency of each class in the training set C) The counts of true positives, false positives, true negatives, and false negatives for a classifier D) The distance between cluster centroids Answer: C
C) Preserve historical attribute values by inserting a new row with versioning and effective dates D) Store dimension data in a separate NoSQL database Answer: C Explanation: SCD Type 2 maintains a full history of changes by creating new rows, allowing analysts to track attribute evolution over time. Question 34. Which Spark component is responsible for executing SQL queries on structured data? A) Spark Streaming B) Spark Core C) Spark SQL (Catalyst optimizer) D) GraphX Answer: C Explanation: Spark SQL provides a DataFrame API and a SQL interface, leveraging the Catalyst optimizer for query planning. Question 35. In a NoSQL document store, which feature most directly supports schema flexibility? A) Fixed column definitions at table creation B) Ability to store JSON-like documents where each record can have a different set of fields C) Enforced foreign key constraints D) Mandatory primary key for every document Answer: B Explanation: Document stores (e.g., MongoDB) allow each document to contain a unique structure, enabling schema-on-read flexibility.
Question 36. Which of the following best defines “data masking”? A) Encrypting data at rest using a secret key B) Replacing sensitive data elements with fictional but realistic values for non-production use C) Deleting all personally identifiable information from a dataset D) Compressing data to reduce storage space Answer: B Explanation: Data masking substitutes real sensitive values with obfuscated equivalents, preserving format while protecting confidentiality. Question 37. A “data steward” is primarily responsible for: A) Writing production-grade machine-learning code B) Managing the physical servers that host the data warehouse C) Ensuring data quality, definitions, and compliance within a specific domain D) Designing network topology for data replication Answer: C Explanation: Data stewards oversee data assets, enforce standards, and act as custodians for data quality and governance. Question 38. Which of the following is an example of a “soft delete” strategy in database design? A) Physically removing a row from the table using DELETE B) Archiving the record to a separate backup table and purging from the main table C) Adding a Boolean “IsActive” flag to indicate logical deletion while keeping the row D) Encrypting the row’s primary key to hide it from queries Answer: C
B) Star schemas have a single fact table surrounded by denormalized dimensions, while snowflake schemas further normalize dimensions into multiple related tables C) Snowflake schemas are only used in NoSQL environments D) Star schemas require a graph database to implement Answer: B Explanation: Star schemas favor simplicity and query performance with flat dimension tables; snowflake schemas break dimensions into hierarchical tables for storage efficiency. Question 42. Which of the following is a common pitfall when using K-Means clustering on high-dimensional data? A) The algorithm automatically determines the optimal number of clusters B) Distance metrics become less meaningful, leading to poor cluster separation (curse of dimensionality) C) K-Means guarantees globally optimal clusters regardless of initialization D) K-Means can handle categorical variables without preprocessing Answer: B Explanation: In high dimensions, Euclidean distances lose discriminative power, causing K-Means to produce unreliable clusters. Question 43. Which of the following best describes “feature engineering” in a data mining workflow? A) Automatically selecting the best model hyperparameters using grid search B) Creating, transforming, or selecting variables that improve model performance C) Deploying the final model to a production environment D) Visualizing model predictions on a dashboard Answer: B
Explanation: Feature engineering involves crafting meaningful inputs (e.g., scaling, encoding, interaction terms) to enhance predictive power. Question 44. In the context of model deployment, “A/B testing” is used to: A) Compare two versions of a model on live traffic to evaluate performance differences B) Validate data quality before training a model C) Split the dataset into training and testing sets D) Perform cross-validation on a single model version Answer: A Explanation: A/B testing serves as an online experiment where users are randomly exposed to variant models, measuring real-world impact. Question 45. Which of the following is NOT a typical component of a data mining project charter? A) Business objectives and success criteria B) Detailed source code of the chosen algorithm C) Scope, timeline, and stakeholder responsibilities D) Data sources and expected deliverables Answer: B Explanation: The charter outlines goals, scope, and resources; the actual algorithm code belongs in technical documentation, not the charter. Question 46. The “right to be forgotten” under GDPR primarily mandates: A) Immediate deletion of all data upon request, regardless of legal obligations B) Erasure of personal data when it is no longer necessary for the purpose it was collected, unless other legal bases apply
Question 49. Which of the following statements about “data anonymization” is correct? A) Anonymization guarantees that data can never be re-identified, regardless of external data sources B) It replaces personally identifiable information with pseudonyms while preserving analytical utility C) It is the same as encryption because the original data can be recovered with a key D) It is only required for healthcare data, not for other industries Answer: B Explanation: Anonymization removes or replaces identifiers to protect privacy while aiming to retain data usefulness; absolute re-identification resistance is difficult to guarantee. Question 50. When evaluating a regression model, the “Mean Absolute Error (MAE)” is preferred over “Root Mean Squared Error (RMSE)” when: A) Outliers are a major concern and you want a metric less sensitive to them B) You need a metric that penalizes larger errors more heavily C) The target variable is categorical D) You are comparing models with different units of measurement Answer: A Explanation: MAE treats all errors linearly, making it less affected by extreme outliers compared to RMSE, which squares errors. Question 51. Which of the following best describes a “data mart”? A) A full-scale enterprise data warehouse that stores all corporate data B) A specialized subset of a data warehouse focused on a specific business line or department C) A real-time streaming platform for sensor data
D) A backup repository for disaster recovery only Answer: B Explanation: Data marts are smaller, subject-oriented data stores that draw from the central warehouse to serve departmental needs. Question 52. In a governance framework, “data stewardship” is distinct from “data ownership” because: A) Ownership confers legal liability, while stewardship focuses on day-to-day data quality and usage B) Owners write SQL queries, stewards design hardware C) Ownership is only relevant for external data vendors D) Stewards are responsible for encrypting all data at rest Answer: A Explanation: Data owners hold ultimate authority and accountability, whereas stewards manage the operational aspects of data quality and compliance. Question 53. Which of the following is a key advantage of using columnar storage for analytical workloads? A) Faster transactional inserts and updates B) Improved compression and scan performance for queries that access a subset of columns C) Simplified schema design for unstructured data D) Automatic handling of graph relationships Answer: B Explanation: Columnar formats store each column contiguously, enabling high compression ratios and efficient reads of selected columns. Question 54. In a machine-learning pipeline, “model drift” refers to: