



































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Data Engineering Professional exam tests knowledge in managing and engineering data infrastructures, including database management, ETL (Extract, Transform, Load) processes, and data pipelines. Candidates will also be assessed on their ability to work with large-scale data systems and ensure data integrity.
Typology: Exams
1 / 75
This page cannot be seen from the preview
Don't miss anything!




































































Question 1. Which schema type stores dimensional attributes in a single, flat table surrounded by fact tables? A) Snowflake B) Star C) Galaxy D) Normalized Answer: B Explanation: A star schema places denormalized dimension tables directly around a central fact table, forming a star‑like shape. Question 2. In normalization, which normal form eliminates transitive dependencies? A) 1NF B) 2NF C) 3NF D) BCNF Answer: C Explanation: Third Normal Form removes attributes that depend on non‑key attributes, thus eliminating transitive dependencies. Question 3. Which Data Vault component captures the relationship between hubs? A) Satellite B) Link C) Hub D) Reference Table Answer: B Explanation: Links in Data Vault model many‑to‑many relationships between business keys (hubs). Question 4. When designing a data lake, the “Raw” zone typically stores data in which format? A) Parquet B) CSV C) Avro D) JSON Answer: B Explanation: Raw zones keep data as‑is, often in flat files like CSV to preserve original structure. Question 5. Which partitioning strategy is best for time‑series data to improve query performance? A) Hash partitioning B) Range partitioning on date C) List partitioning D) No partitioning Answer: B
Explanation: Range partitioning by date allows pruning of irrelevant partitions when querying time‑based ranges. Question 6. In Slowly Changing Dimensions, Type 2 tracks history by: A) Overwriting values B) Adding new columns C) Adding new rows D) Deleting old rows Answer: C Explanation: Type 2 creates a new row for each version of a dimension record, preserving full history. Question 7. Which architecture processes data in both batch and real‑time layers? A) Lambda B) Kappa C) Lambda‑Kappa Hybrid D) Pure Streaming Answer: A Explanation: Lambda architecture combines a batch layer for historical data with a speed layer for real‑time updates. Question 8. Which file format provides columnar storage and efficient compression for analytics? A) CSV B) JSON C) Parquet D) TXT Answer: C Explanation: Parquet stores data column‑wise, enabling better compression and faster column‑based scans. Question 9. In distributed computing, which principle ensures that a task can be split into independent subtasks? A) Fault tolerance B) Data locality C) Parallelism D) Consistency Answer: C Explanation: Parallelism allows a job to be divided into concurrent subtasks that run on multiple nodes.
Question 15. Time‑series databases are typically optimized for: A) OLTP transactions B) Large binary objects C) High‑velocity append‑only data D) Relational joins Answer: C Explanation: They specialize in ingesting and querying sequential, timestamped data efficiently. Question 16. In HDFS, the default replication factor is: A) 1 B) 2 C) 3 D) 4 Answer: C Explanation: HDFS replicates each block three times to ensure fault tolerance. Question 17. Which object‑storage consistency model provides read‑after‑write guarantees for new objects? A) Eventual B) Strong C) Read‑your‑writes D) Session Answer: C Explanation: “Read‑your‑writes” ensures that a client sees its own writes immediately after they complete. Question 18. Hive Metastore primarily stores: A) Raw data B) Query results C) Table metadata D) Access logs Answer: C Explanation: The metastore holds schema definitions, partitions, and other metadata for Hive tables. Question 19. Which ETL approach pushes transformation logic to the target system? A) Traditional ETL B) ELT C) CDC D) Data Virtualization Answer: B
Explanation: EL (Extract‑Load) moves data first, then performs transformations inside the destination (e.g., a data warehouse). Question 20. In Apache Spark, what is the purpose of the “shuffle” operation? A) Persist data to disk B) Repartition data across executors C) Cache RDDs D) Broadcast variables Answer: B Explanation: Shuffle redistributes data based on key to enable operations like groupBy or join across partitions. Question 21. Which Spark configuration reduces memory pressure when many shuffles occur? A) spark.sql.shuffle.partitions B) spark.serializer C) spark.executor.memory D) spark.driver.cores Answer: A Explanation: Adjusting the number of shuffle partitions controls the size and count of intermediate data files. Question 22. In streaming, “event time” refers to: A) Time when the event is processed B) Time when the event was generated C) System clock D) Watermark timestamp Answer: B Explanation: Event time is the timestamp embedded in the data, representing when the event actually occurred. Question 23. Which Kafka consumer offset management strategy ensures exactly‑once processing when combined with idempotent writes? A) Auto‑commit B) Manual commit after processing C) No commit D) Commit on poll Answer: B
Question 28. Which CI/CD practice is essential for automated testing of data pipelines before production deployment? A) Blue‑green deployment B) Canary release C) Unit test stage D) Feature toggle Answer: C Explanation: Adding a unit‑test stage to the pipeline validates transformations and logic automatically. Question 29. A logging framework that writes structured JSON logs is most useful for: A) Human readability B) Reducing storage cost C) Automated log parsing D) Encrypting logs Answer: C Explanation: Structured JSON enables downstream tools to query and analyze logs programmatically. Question 30. Which metric is most indicative of a pipeline’s latency? A) Throughput (records/sec) B) CPU utilization C) End‑to‑end processing time D) Disk I/O Answer: C Explanation: Latency measures the time taken from data arrival to its availability downstream. Question 31. Data completeness is a dimension of data quality that measures: A) Accuracy of values B) Presence of expected records C) Consistency across sources D) Timeliness of updates Answer: B Explanation: Completeness assesses whether all required data elements or rows are present. Question 32. Which technique masks sensitive columns while preserving format for analytics? A) Encryption B) Tokenization C) Data vaulting D) Data masking with deterministic hashing Answer: D
Explanation: Deterministic hashing replaces values with a consistent pseudonym, keeping format but hiding actual data. Question 33. GDPR’s “right to be forgotten” primarily requires: A) Data encryption B) Data minimization C) Deletion of personal data upon request D) Anonymization of logs Answer: C Explanation: Individuals can request erasure of their personal data, obligating organizations to delete it. Question 34. In metadata management, a data lineage diagram shows: A) Physical storage locations B) Access control lists C) Flow of data from source to consumer D) Data quality scores Answer: C Explanation: Lineage traces how data moves, transforms, and is used across systems. Question 35. Which of the following is a characteristic of a column‑family NoSQL database? A) Fixed schema B) Row‑level locking C) Wide rows with dynamic columns D) Graph traversal API Answer: C Explanation: Column‑family stores (e.g., Cassandra) allow rows to have a flexible set of columns, often used for time‑series. Question 36. In a star schema, which table typically contains the most rows? A) Fact table B) Dimension table C) Bridge table D) Lookup table Answer: A Explanation: Fact tables store transactional events and therefore have the highest row count.
Answer: C Explanation: AWS Glue Data Catalog stores table definitions and works with Athena, EMR, and other services. Question 42. When using CDC with Debezium, which connector type reads changes from a MySQL binlog? A) JDBC B) LogMiner C) MySQL Connector D) MongoDB Connector Answer: C Explanation: Debezium’s MySQL connector parses the MySQL binary log to capture row changes. Question 43. Which of the following is a common cause of data skew in Spark joins? A) Small shuffle partitions B) Unequal key distribution C) Excessive caching D) High executor memory Answer: B Explanation: When a few keys dominate, tasks processing those keys become bottlenecks, causing skew. Question 44. In Azure Data Factory, a “Mapping Data Flow” is primarily used for: A) Orchestrating pipelines B) Visual data transformation C) Managing secrets D) Monitoring logs Answer: B Explanation: Mapping Data Flows provide a UI‑based, Spark‑backed transformation engine. Question 45. Which security principle ensures users can only access data necessary for their role? A) Least privilege B) Defense in depth C) Separation of duties D) Auditing Answer: A Explanation: Least privilege restricts permissions to the minimum required for a user’s responsibilities. Question 46. A data lake’s “Curated” zone typically contains:
A) Raw ingestion files B) Unstructured logs C) Cleaned, structured datasets D) Backup snapshots Answer: C Explanation: The curated zone holds transformed, quality‑checked data ready for analytics. Question 47. Which SQL clause is used to remove duplicate rows from a result set? A) DISTINCT B) GROUP BY C) HAVING D) UNION ALL Answer: A Explanation: DISTINCT eliminates duplicate rows in the query output. Question 48. In a graph database, a “traversal” operation most often uses which algorithm? A) QuickSort B) Breadth‑First Search C) Hash Join D) Merge Sort Answer: B Explanation: Traversals explore connected nodes; BFS is a standard method for level‑by‑level exploration. Question 49. Which NoSQL database provides built‑in support for time‑to‑live (TTL) on records? A) MongoDB B) Redis C) Cassandra D) Neo4j Answer: C Explanation: Cassandra allows setting a TTL per column, after which data expires automatically. Question 50. In a data warehouse, “conformed dimensions” refer to: A) Dimensions shared across multiple fact tables B) Dimensions with hierarchical levels C) Dimensions stored in a separate schema D) Dimensions encrypted at rest Answer: A Explanation: Conformed dimensions have consistent definitions and values across different fact tables, enabling integrated reporting.
A) Faster row inserts B) Reduced storage for binary data C) Efficient column‑wise compression D) Simplified ACID transactions Answer: C Explanation: Columnar formats compress similar values together, improving scan speed for column‑based queries. Question 56. Which Spark operation is lazy and does not trigger execution until an action is called? A) collect() B) map() C) reduce() D) show() Answer: B Explanation: Transformations like map are evaluated lazily; actions such as collect trigger execution. Question 57. In Kafka Streams, the state store is used to: A) Persist processed records B) Maintain intermediate aggregation state C) Store configuration D) Buffer outbound messages Answer: B Explanation: State stores keep the current aggregation or join state enabling exactly‑once processing. Question 58. Which data quality rule checks that a numeric column contains only positive values? A) Uniqueness B) Range validation C) Pattern match D) Referential integrity Answer: B Explanation: Range validation ensures values fall within an acceptable interval (e.g., >0). Question 59. Which AWS service enables serverless execution of Spark jobs? A) Redshift B) EMR on EKS C) Glue Spark Jobs D) Athena Answer: C
Explanation: AWS Glue provides managed Spark environments for ETL without provisioning clusters. Question 60. In a CI/CD pipeline, which stage should contain “dbt test” commands? A) Build B) Deploy C) Test D) Release Answer: C Explanation: dbt tests validate data transformations and are run during the testing stage. Question 61. Which of the following is a best practice for handling schema evolution in a data lake? A) Overwrite existing files B) Use versioned folders C) Delete old data D) Store all data in a single CSV Answer: B Explanation: Versioned folders (or partitioned by version) allow backward compatibility while preserving history. Question 62. Which SQL clause is used to rename a column in the result set? A) RENAME B) AS C) ALIAS D) LABEL Answer: B Explanation: The AS keyword provides a column alias in the SELECT list. Question 63. In a Kappa architecture, how are updates to historical data handled? A) Reprocess entire batch layer B) Append‑only log C) Update via speed layer only D) Use CDC to rewrite history Answer: B Explanation: Kappa relies on an immutable log (e.g., Kafka) where new events are appended; reprocessing is done by replaying the log. Question 64. Which Hadoop component provides a distributed, fault‑tolerant key‑value store?
Question 69. Which SQL function can be used to calculate the rank of rows within a partition? A) ROW_NUMBER() B) RANK() C) DENSE_RANK() D) All of the above Answer: D Explanation: All three are window functions that assign ranking values, differing in handling ties. Question 70. In a data pipeline, “backfill” refers to: A) Deleting old data B) Re‑processing historical data C) Scaling out workers D) Archiving logs Answer: B Explanation: Backfill runs the pipeline on past data to fill missing partitions or correct earlier issues. Question 71. Which data encryption method protects data at rest in cloud object storage? A) TLS B) AES‑256 C) MD5 D) SHA‑ 256 Answer: B Explanation: AES‑256 is a symmetric encryption algorithm commonly used for encrypting stored objects. Question 72. Which of the following is a characteristic of a “cold” data lake zone? A) Frequently accessed B) Low latency C) Archived, infrequently accessed D) Real‑time streaming Answer: C Explanation: Cold zones store rarely accessed historical data, often on cheaper storage tiers. Question 73. In Airflow, what does the “schedule_interval” parameter define? A) Task retry count B) DAG execution frequency C) Number of workers D) Logging level Answer: B Explanation: schedule_interval sets how often the DAG should be triggered (e.g., daily, hourly).
Question 74. Which NoSQL database uses “CQL” as its query language? A) MongoDB B) Cassandra C) Redis D) DynamoDB Answer: B Explanation: Cassandra Query Language (CQL) resembles SQL but operates on a column‑family store. Question 75. What is the primary advantage of using a “bridge” table in a star schema? A) Enforce referential integrity B) Model many‑to‑many relationships C) Store audit logs D) Reduce storage size Answer: B Explanation: Bridge tables resolve many‑to‑many links between facts and dimensions. Question 76. Which Spark configuration controls the amount of memory used for broadcast variables? A) spark.sql.shuffle.partitions B) spark.broadcast.blockSize C) spark.memory.fraction D) spark.serializer Answer: B Explanation: spark.broadcast.blockSize sets the size threshold for broadcasting data to executors. Question 77. In Kafka, what is the purpose of a “consumer group”? A) Replicate topics B) Provide high availability C) Enable parallel consumption with offset tracking D) Encrypt messages Answer: C Explanation: Consumer groups allow multiple consumers to share the load of a topic while maintaining per‑group offsets. Question 78. Which AWS service is designed for real‑time analytics on streaming data?
Explanation: Upserts based on a stable business key ensure repeated writes do not create duplicates. Question 83. In Azure Synapse, what does “serverless SQL pool” allow? A) Provisioning dedicated VMs B) Querying data directly from storage without provisioning compute C) Running Spark jobs D) Managing Data Factory pipelines Answer: B Explanation: Serverless SQL pool lets you run T‑SQL queries over files in ADLS without dedicated clusters. Question 84. Which of the following best describes “data provenance”? A) Encryption method B) Data lineage C) Data profiling D) Data masking Answer: B Explanation: Provenance tracks the origin and transformations applied to a data element. Question 85. Which tool is commonly used for visualizing Airflow DAGs? A) Grafana B) Airflow UI C) Kibana D) Tableau Answer: B Explanation: The Airflow web UI provides a graphical view of DAG structures and task status. Question 86. In a relational database, what does the term “cascading delete” refer to? A) Deleting rows in batches B) Automatically deleting child rows when a parent is removed C) Archiving deleted rows D) Using triggers to log deletions Answer: B Explanation: Cascading delete propagates the delete operation to related foreign‑key rows. Question 87. Which of the following is a key advantage of using a “schema‑on‑read” approach?
A) Immediate data validation B) Flexibility to store raw data without upfront schema C) Faster query execution D) Enforced data types at ingestion Answer: B Explanation: Schema‑on‑read allows raw data to be stored first; the schema is applied only when data is read. Question 88. Which Spark transformation can cause a full shuffle of data across the cluster? A) map() B) filter() C) groupByKey() D) sample() Answer: C Explanation: groupByKey requires moving all values for a key to the same executor, triggering a shuffle. Question 89. In Kafka, what does the “replication factor” determine? A) Number of partitions B) Number of brokers C) Number of copies of each partition D) Consumer group size Answer: C Explanation: Replication factor sets how many broker replicas store each partition for fault tolerance. Question 90. Which of the following is a typical use case for a “lookup” table in ETL? A) Storing transaction logs B) Mapping codes to descriptive values C) Archiving raw files D) Managing user sessions Answer: B Explanation: Lookup tables translate foreign keys or codes into human‑readable descriptions during transformation. Question 91. Which cloud storage consistency model guarantees that a read after a write will return the latest value? A) Eventual consistency B) Strong consistency C) Causal consistency D) Session consistency