Data Engineering Master Practice Exam, Exams of Technology

This advanced exam targets professionals with a deeper understanding of data engineering. Topics include data architecture, cloud data storage, big data technologies, data governance, and advanced data processing techniques.

Typology: Exams

2025/2026

Available from 01/16/2026

shilpi-jain-1
shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 89

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Engineering Master Practice Exam
**Question 1. Which architecture separates storage and compute to allow independent scaling of each
component?**
A) Traditional Data Warehouse
B) Data Lakehouse
C) Lambda Architecture
D) Data Mesh
Answer: B
Explanation: A Data Lakehouse decouples storage (often object storage) from compute (e.g., Spark,
Presto), enabling independent scaling.
**Question 2. In a Data Mesh, what is the primary responsibility of a domain team?**
A) Managing a central data warehouse
B) Owning and serving data as a product
C) Writing ETL scripts for all domains
D) Enforcing global security policies only
Answer: B
Explanation: Data Mesh promotes domainoriented ownership where each team treats its data as a
product for other consumers.
**Question 3. Which of the following best describes the key difference between Lambda and Kappa
architectures?**
A) Lambda uses only batch processing, Kappa only streaming
B) Lambda maintains separate batch and speed layers, Kappa eliminates the batch layer
C) Kappa requires a data lake, Lambda does not
D) Lambda is only for realtime analytics, Kappa for reporting
Answer: B
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59

Partial preview of the text

Download Data Engineering Master Practice Exam and more Exams Technology in PDF only on Docsity!

Question 1. Which architecture separates storage and compute to allow independent scaling of each component? A) Traditional Data Warehouse B) Data Lakehouse C) Lambda Architecture D) Data Mesh Answer: B Explanation: A Data Lakehouse decouples storage (often object storage) from compute (e.g., Spark, Presto), enabling independent scaling. Question 2. In a Data Mesh, what is the primary responsibility of a domain team? A) Managing a central data warehouse B) Owning and serving data as a product C) Writing ETL scripts for all domains D) Enforcing global security policies only Answer: B Explanation: Data Mesh promotes domain‑oriented ownership where each team treats its data as a product for other consumers. Question 3. Which of the following best describes the key difference between Lambda and Kappa architectures? A) Lambda uses only batch processing, Kappa only streaming B) Lambda maintains separate batch and speed layers, Kappa eliminates the batch layer C) Kappa requires a data lake, Lambda does not D) Lambda is only for real‑time analytics, Kappa for reporting Answer: B

Explanation: Lambda keeps both batch and real‑time (speed) layers; Kappa simplifies by using a single streaming layer for both. Question 4. Which cloud storage service is an object‑store that provides virtually unlimited scalability and is commonly used as a data lake foundation? A) Amazon Redshift B) Azure Synapse C) Google Cloud Storage D) Amazon RDS Answer: C Explanation: Google Cloud Storage (like S3 or Azure Blob) is an object store ideal for raw data lake storage. Question 5. Which IaC tool uses a declarative JSON/YAML language and is native to AWS? A) Terraform B) Ansible C) CloudFormation D) Pulumi Answer: C Explanation: AWS CloudFormation uses JSON/YAML templates to provision AWS resources declaratively. Question 6. Which cost‑optimization technique reduces storage expense by moving infrequently accessed data to a cheaper tier automatically? A) Auto‑scaling compute clusters B) Spot instances for Spark executors C) Lifecycle policies for object storage

A) AWS Glue Data Catalog B) Amazon DynamoDB C) AWS Lake Formation D) Amazon Athena Answer: A Explanation: AWS Glue Data Catalog stores Hive metastore information and provides a unified metadata layer. Question 10. In relational modeling, what is the primary purpose of normalization to 3NF? A) Reduce read latency B) Eliminate data redundancy and update anomalies C) Optimize for analytical queries D) Enable horizontal sharding automatically Answer: B Explanation: 3NF removes transitive dependencies, minimizing redundancy and preventing update anomalies. Question 11. When would denormalization be most appropriate in a data warehouse? A) To ensure ACID compliance for OLTP workloads B) To reduce join complexity and improve query performance for reporting C) To simplify ETL logic for CDC D) To enforce referential integrity across micro‑services Answer: B Explanation: Denormalized star schemas reduce joins, speeding up analytical queries. Question 12. Which index type stores the data rows in the same order as the index key?

A) Non‑clustered index B) Composite index C) Clustered index D) Bitmap index Answer: C Explanation: A clustered index orders the table data physically according to the index key. Question 13. What is the primary benefit of sharding a relational database? A) Guarantees strong consistency across all shards B) Enables horizontal scaling by distributing rows across multiple servers C) Removes the need for a primary key D) Allows automatic schema changes without downtime Answer: B Explanation: Sharding partitions data horizontally, allowing the system to scale out by adding more nodes. Question 14. In a star schema, which table contains the business metrics to be aggregated? A) Dimension table B) Fact table C) Bridge table D) Lookup table Answer: B Explanation: Fact tables store quantitative data (sales, clicks) that are aggregated in analytical queries. Question 15. Which type of fact table captures a snapshot of the system at regular intervals?

Question 18. Which graph database query language is widely used for pattern matching? A) SQL B) Gremlin C) Cypher D) SPARQL Answer: C Explanation: Cypher, used by Neo4j, expresses graph patterns in a declarative way. Question 19. In the CAP theorem, which property does a BASE system typically sacrifice? A) Availability B) Partition tolerance C) Consistency D) Durability Answer: C Explanation: BASE (Basically Available, Soft state, Eventual consistency) relaxes strict consistency for higher availability. Question 20. Which of the following best describes ELT in modern cloud data pipelines? A) Transform data before loading into the warehouse B) Load raw data into the warehouse and then transform using the warehouse’s compute engine C) Perform transformations on a separate ETL server before loading D) Use only batch processing for all steps Answer: B Explanation: ELT leverages the scalability of cloud warehouses to transform data after loading.

Question 21. Which CDC technique reads the transaction log of a source database to capture changes? A) Trigger‑based CDC B) Timestamp column polling C) Log‑based CDC D) Full table snapshot Answer: C Explanation: Log‑based CDC parses the database’s write‑ahead log to obtain inserts, updates, and deletes. Question 22. Which data quality check would be most appropriate to detect duplicate primary keys before loading into a fact table? A) Null‑value percentage B) Uniqueness constraint validation C) Data type conformity D) Range check for numeric fields Answer: B Explanation: Uniqueness validation ensures that primary key values are not duplicated. Question 23. In Spark, what is the role of the Driver program? A) Executes tasks on worker nodes B) Stores data in memory across the cluster C) Coordinates job scheduling, task distribution, and maintains SparkContext D) Manages HDFS block placement Answer: C

Answer: B Explanation: The NameNode holds the namespace, file-to-block mapping, and metadata. Question 27. Which orchestration tool uses Directed Acyclic Graphs (DAGs) defined in Python scripts? A. Prefect B. Dagster C. Apache Airflow D. Luigi Answer: C Explanation: Airflow pipelines are expressed as Python DAG objects. Question 28. What is the primary purpose of making a task idempotent in a data pipeline? A. Reduce execution time B. Ensure the task can be retried safely without side effects C. Enable parallel execution D. Simplify logging Answer: B Explanation: Idempotent tasks produce the same result when run multiple times, allowing safe retries. Question 29. Which streaming model processes data in micro‑batches while providing near‑real‑time results? A. Pure stream processing B. Lambda Architecture batch layer C. Structured Streaming (Spark)

D. Kappa Architecture with Flink நிர Answer: C Explanation: Spark Structured Streaming treats streaming data as a series of micro‑batches, delivering low‑latency results. Question 30. In Kafka, what ensures that messages within a partition are delivered in order? A. Consumer groups B. Topic replication factor C. Partition ordering guarantee D. Log compaction Answer: C Explanation: Kafka guarantees order only within a single partition; consumers receive records in the order they were appended. Question 31. Which Kafka consumer configuration enables exactly‑once processing when used with idempotent producers? A. enable.auto.commit = true B. isolation.level = read_committed C. max.poll.records = 500 D. fetch.min.bytes = 1 Answer: B Explanation: isolation.level=read_committed makes the consumer see only committed transactions, supporting exactly‑once semantics. Question 32. Which window type aggregates events that arrive within a fixed, non‑overlapping interval? A. Tumbling window

A. Data masking B. Tokenization C. Hashing D. Salting Answer: B Explanation: Tokenization substitutes data with a token that can be mapped back via a secure token vault. Question 36. Which tool is commonly used to visualize data lineage across Apache Spark jobs? A. AWS Glue DataBrew B. Apache Atlas C. Amazon QuickSight D. Databricks Unity Catalog Answer: B Explanation: Apache Atlas tracks metadata and lineage for Hadoop ecosystem components, including Spark. Question 37. Under GDPR, which principle requires that personal data be kept only as long as necessary for its purpose? A. Data minimization B. Right to be forgotten C. Storage limitation D. Purpose limitation Answer: C Explanation: Storage limitation mandates deleting or anonymizing data when it is no longer needed.

Question 38. Which component of a Feature Store is responsible for serving low‑latency features to online inference services? A. Offline store B. Batch processing pipeline C. Online store (key‑value cache) D. Feature registry Answer: C Explanation: The online store provides fast, point‑lookup access for real‑time model inference. Question 39. What is data drift in the context of MLOps? A. The gradual increase in model size over time B. Changes in the statistical properties of input data that differ from training data C. The latency increase in model serving endpoints D. The shift from batch to streaming pipelines Answer: B Explanation: Data drift refers to a distribution shift in incoming data that can degrade model performance. Question 40. Which technique converts categorical variables into a numeric vector suitable for linear models? A. One‑hot encoding B. Hashing trick C. Label encoding D. Frequency encoding Answer: A

D. AUC‑ROC

Answer: C Explanation: Latency monitoring focuses on response time percentiles to capture tail performance. Question 44. Which design pattern helps avoid “pipeline hell” by separating transformation logic into reusable components? A. Monolithic ETL B. Lambda functions only C. Modular DAG nodes (task libraries) D. Inline SQL scripts Answer: C Explanation: Modular DAG nodes promote reuse and simplify maintenance across pipelines. Question 45. Which storage format provides built-in support for predicate pushdown, enabling faster query filtering? A. CSV B. JSON C. Parquet D. TXT Answer: C Explanation: Parquet stores column statistics allowing query engines to skip irrelevant row groups. Question 46. In a multi‑region data lake, which strategy minimizes data egress costs while keeping data close to compute? A. Replicate all data to every region B. Use a single global bucket

C. Store data in a regional bucket and enable cross‑region replication only for hot data D. Move compute to the data’s home region using serverless services Answer: D Explanation: Running compute in the same region as the data avoids egress charges; selective replication can be added for specific use cases. Question 47. Which Terraform feature enables reuse of common infrastructure code across multiple environments? A. Provider blocks B. Variables C. Modules D. Workspaces Answer: C Explanation: Modules package reusable configurations that can be instantiated with different inputs. Question 48. Which Spark optimization automatically caches the most frequently accessed DataFrames? A. Broadcast joins B. Adaptive Query Execution (AQE) C. Tungsten off‑heap storage D. Cache‑aware scheduling Answer: B Explanation: AQE can decide to cache intermediate results based on runtime statistics. Question 49. Which Kafka configuration determines the number of replicas for each partition? A. num.partitions

Question 52. Which data modeling pattern is ideal for representing many‑to‑many relationships without duplication in a relational warehouse? A. Bridge (association) table B. Snowflake schema C. Star schema D. Factless fact table Answer: A Explanation: A bridge table stores pairs of foreign keys to model many‑to‑many links efficiently. Question 53. Which Apache Flink feature enables exactly‑once state consistency across checkpoints? A. Event‑time processing B. Asynchronous I/O C. RocksDB state backend with checkpointing D. Window aggregation Answer: C Explanation: Flink’s checkpointing with a durable state backend (e.g., RocksDB) provides exactly‑once state guarantees. Question 54. Which security practice helps prevent accidental exposure of credentials in CI/CD pipelines? A. Hard‑coding secrets in scripts B. Storing secrets in environment variables only at runtime C. Committing encrypted keys to the repository D. Using a secrets manager with dynamic credentials injection Answer: D

Explanation: Secrets managers (e.g., AWS Secrets Manager) provide temporary credentials and avoid static secret storage. Question 55. Which data governance principle mandates that each data asset has an assigned owner responsible for its quality? A. Data stewardship B. Data lineage C. Data cataloging D. Data provenance Answer: A Explanation: Data stewardship assigns accountability for data quality, definitions, and lifecycle. Question 56. Which of the following is a primary advantage of using a lakehouse over a traditional data lake? A. Unlimited schema flexibility without any enforcement B. Built‑in ACID transactions and native SQL querying on raw files C. Elimination of all data duplication D. Automatic machine learning model generation Answer: B Explanation: Lakehouses combine the low‑cost storage of a lake with transactional guarantees and SQL engines. Question 57. Which Apache Airflow feature allows tasks to be retried with exponential back‑off? A. retry_delay B. max_active_runs C. depends_on_past