


















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Databricks Data Engineering Professional Ultimate Exam is an advanced preparation resource for professionals working with large-scale data pipelines and cloud-based data engineering solutions. This exam emphasizes ETL development, Spark optimization, Delta Lake architecture, streaming data processing, workflow orchestration, data governance, and performance tuning within the Databricks ecosystem. Candidates will develop expertise in designing scalable data solutions, managing distributed computing environments, and implementing efficient data workflows. With challenging practice questions and in-depth explanations, this Ultimate Exam supports certification preparation and advanced career development in modern data engineering.
Typology: Exams
1 / 58
This page cannot be seen from the preview
Don't miss anything!



















































**Question 1. Which Databricks feature allows you to define a reusable, version-controlled set of notebooks, libraries, and configurations for a data pipeline? ** A) Unity Catalog B) Databricks Asset Bundles (DABs) C) Lakeflow D) Delta Sharing Answer: B Explanation: DABs package notebooks, libraries, and settings into a single versioned artifact, enabling reuse across environments. Question 2. In the Medallion Architecture, what is the primary purpose of the Bronze layer? A) Store curated, business-ready tables B) Host aggregated metrics for reporting C) Ingest raw, immutable source data D) Provide pre-computed machine-learning features Answer: C Explanation: The Bronze layer captures raw data exactly as it arrives, preserving source fidelity for downstream processing. Question 3. Which Spark join strategy is most efficient when one side of the join is small enough to fit in memory on each executor? A) Shuffle join B) Broadcast join C) Sort-merge join D) Cartesian join Answer: B
Explanation: Broadcast joins replicate the small dataset to all executors, avoiding a costly shuffle. Question 4. What does the AUTO CDC API in Databricks simplify? A) Automatic schema inference for JSON files B) Change Data Capture for Delta tables C) Auto-scaling of SQL warehouses D) Generation of data lineage graphs Answer: B Explanation: AUTO CDC (formerly Apply Changes) automates CDC operations, merging source changes into Delta tables. Question 5. Which of the following is a benefit of Liquid Clustering? A) Guarantees physical ordering of rows on disk B) Enables multi-dimensional clustering without repartitioning data C) Replaces Z-ordering for all workloads D) Provides real-time row-level security Answer: B Explanation: Liquid Clustering allows logical clustering on multiple columns without moving data, improving query pruning. Question 6. When using Auto Loader, which option ensures schema evolution without job restarts? A) cloudFiles.schemaLocation B) cloudFiles.maxFilesPerTrigger C) cloudFiles.includeExistingFiles D) cloudFiles.format
Answer: C Explanation: OPTIMIZE rewrites files into larger ones and can apply Z-ordering for better pruning. Question 10. What is the default retention period for Delta table history before it can be vacuumed? A) 7 days B) 14 days C) 30 days D) 90 days Answer: B Explanation: Delta retains 14 days of history by default, after which files become eligible for deletion by VACUUM. Question 11. Which Unity Catalog object type is used to control access to an external cloud storage location? A) Managed Table B) External Table C) Volume D) Function Answer: C Explanation: Volumes represent external storage and can be governed by Unity Catalog policies. Question 12. How does row-level security differ from column-level security in Unity Catalog?
A) Row-level filters rows; column-level masks column values B) Row-level masks; column-level filters rows C) Both apply the same policy syntax D) Neither is supported in Unity Catalog Answer: A Explanation: Row-level security uses predicates to include/exclude rows, while column-level security can hide or mask specific columns. Question 13. Which of the following is a valid way to share a Delta table with an external partner without moving data? A) Export as CSV and email B) Create a public S3 bucket C) Use Delta Sharing via a share object D) Copy the table to a new workspace Answer: C Explanation: Delta Sharing provides secure, read-only access to Delta tables across organizations without data duplication. Question 14. In Spark, what does the mapPartitions transformation do compared to map? A) Executes a function on each element individually B) Executes a function on each partition as an iterator, reducing overhead C) Guarantees ordering of output rows D) Performs a join across partitions Answer: B Explanation: mapPartitions processes an entire partition at once, allowing batched operations and lower call overhead.
Explanation: Setting the table property delta.enableChangeDataFeed to true activates the Change Data Feed feature. Question 18. In a streaming job reading from Kafka, which option guarantees exactly-once semantics when writing to a Delta table? A) outputMode("append") only B) checkpointLocation and mergeSchema enabled C) Using foreachBatch with idempotent writes and a checkpoint location D) Setting spark.sql.streaming.maxFilesPerTrigger to 1 Answer: C Explanation: foreachBatch combined with checkpointing ensures each micro-batch is applied exactly once. Question 19. Which of the following best describes a Slowly Changing Dimension Type 2 implementation in Delta Lake? A) Overwrite the existing row with the new value B) Insert a new row with a surrogate key and effective dates, keeping the old row unchanged C) Delete the old row and insert the new one D) Store both old and new values in a JSON column Answer: B Explanation: SCD Type 2 adds a new version of the record with timestamps, preserving historical data. Question 20. What does the DESCRIBE HISTORY command return for a Delta table? A) Physical file locations on storage B) All SQL queries executed on the table C) A chronological list of commits, including operation type and user
D) Column statistics for the latest version Answer: C Explanation: DESCRIBE HISTORY provides the audit trail of commits, showing version, timestamp, operation, and user. Question 21. Which Spark configuration setting controls the maximum size of a shuffle partition? A) spark.sql.shuffle.partitions B) spark.sql.autoBroadcastJoinThreshold C) spark.sql.files.maxPartitionBytes D) spark.sql.adaptive.enabled Answer: A Explanation: spark.sql.shuffle.partitions defines how many partitions are created after a shuffle. Question 22. When would you prefer a Serverless SQL warehouse over a provisioned (Classic) warehouse? A) When you need deterministic instance types for low latency B) For workloads with highly variable query concurrency and unpredictable demand C) When you require GPU acceleration D) For long-running ETL jobs that need dedicated clusters Answer: B Explanation: Serverless warehouses automatically scale compute up and down, ideal for spiky, ad-hoc query patterns. Question 23. Which of the following is NOT a valid method for implementing data lineage in Databricks? A) Unity Catalog’s automatic lineage capture
A) Drops rows that fail the expectation and logs them B) Throws an exception and aborts the pipeline C) Marks rows as “bad” but continues processing D) Converts the expectation into a view Answer: A Explanation: expect_or_drop removes rows that do not satisfy the expectation, ensuring downstream tables contain only valid data. Question 27. Which of the following is a core difference between Managed and External tables in Unity Catalog? A) Managed tables store data inside the workspace’s default storage location; External tables reference data outside of that location. B) External tables support versioning; Managed tables do not. C) Managed tables can be shared via Delta Sharing; External tables cannot. D) There is no functional difference; the terms are interchangeable. Answer: A Explanation: Managed tables are fully controlled by Unity Catalog, while External tables point to data residing elsewhere. Question 28. How does Bloom Filter indexing improve query performance? A) It physically sorts data on disk. B) It creates a probabilistic index that quickly rejects non-matching rows for high-cardinality columns. C) It merges small files into larger ones. D) It enables automatic data replication across regions. Answer: B Explanation: Bloom filters allow fast existence checks, reducing I/O for selective predicates.
Question 29. Which of the following statements about Delta Sharing is true? A) It requires both provider and recipient to be on the same Databricks account. B) Recipients can write to shared tables. C) Sharing is governed by share objects that list specific tables, schemas, or entire catalogs. D) Shared data is copied to the recipient’s storage automatically. Answer: C Explanation: Share objects define what is exposed; recipients get read-only access without data duplication. Question 30. What does the spark.databricks.delta.retentionDurationCheck.enabled configuration control? A) Whether VACUUM respects the 7-day minimum retention rule B) Whether Delta tables automatically purge history after 30 days C) Whether the system checks for orphaned files during OPTIMIZE D) Whether schema enforcement is applied on writes Answer: A Explanation: Setting this to false allows VACUUM to remove files younger than the default 7-day safety window. Question 31. Which of the following is the most efficient way to enforce a primary key constraint on a Delta table during merges? A) Use a MERGE statement with a WHEN MATCHED clause checking the key B) Enable delta.enableDeletionVectors and rely on them for uniqueness C) Create a unique index using CREATE INDEX (not supported) D) Use the OPTIMIZE command with ZORDER BY the key column Answer: A
D) spark.sql.optimizer.dynamicPartitionPruning = false Answer: A Explanation: Setting spark.sql.adaptive.enabled activates AQE, allowing runtime optimizations like coalescing shuffle partitions. **Question 35. In a streaming query, what does the trigger(once=True) option do? ** A) Runs the query continuously with a fixed interval B) Processes all available data once and then stops C) Triggers the query only when a new file appears D) Enables exactly-once semantics for batch jobs Answer: B Explanation: trigger(once=True) processes all accumulated data in a micro-batch and then terminates. Question 36. Which of the following best describes the purpose of a “quarantine” table in a DLT pipeline? A) Stores rows that passed all expectations B) Holds rows that failed expectations for later inspection C) Contains only metadata about pipeline runs D) Archives old versions of the table Answer: B Explanation: Quarantine tables capture “bad” rows that violate expectations, enabling data quality investigations. Question 37. What is the effect of setting spark.databricks.delta.merge.repartitionBeforeWrite to true? A) Forces a shuffle before a MERGE to improve parallelism
B) Disables the MERGE operation entirely C) Writes MERGE results directly to the original files without copy-on-write D) Enables automatic schema evolution Answer: A Explanation: Repartitioning before a MERGE can reduce skew and improve write performance. Question 38. Which of the following is true about Change Data Feed (CDF) in Delta Lake? A) It retains only the last 24 hours of changes. B) It can be queried using TABLE_CHANGES function. C) It requires enabling delta.enableChangeDataFeed at table creation. D) It automatically deletes old change files after 7 days. Answer: C Explanation: Enabling the table property activates CDF; queries are performed via SELECT * FROM table_name CHANGES. Question 39. When would you choose a Range join over a Broadcast join? A) When both tables are small B) When the join keys are numeric and you need to join on a range condition (e.g., >= and <) C) When you want to avoid any shuffle D) When one table is a streaming source Answer: B Explanation: Range joins are designed for inequality predicates on numeric keys, unlike equality-based broadcast joins.
Explanation: Declarative pipelines describe sources, transforms, and sinks in a manifest; Lakeflow orchestrates them automatically. Question 43. Which of the following is the most appropriate way to handle schema drift for JSON files ingested via Auto Loader? A) Set cloudFiles.inferColumnTypes to false B) Enable cloudFiles.schemaHints with a superset of expected fields C) Use mergeSchema option on write D) Manually edit the table schema after each ingestion Answer: C Explanation: mergeSchema allows Delta to evolve the table schema automatically as new fields appear. Question 44. What does the spark.databricks.io.cache.enabled flag control? A) Whether Delta tables are cached in memory B) Whether remote files are cached locally on the driver node only C) Whether the remote file cache (local SSD) is active for all workers D) Whether the notebook UI caches results Answer: C Explanation: This flag enables the distributed remote file cache that stores remote objects on local SSDs of each worker. Question 45. Which of the following is a valid reason to use a foreachBatch sink instead of a built-in sink like writeStream.format("delta")? A) To write each micro-batch to multiple destinations atomically B) To avoid checkpointing altogether C) To guarantee exactly-once semantics without any additional code D) To enable streaming reads from a Delta table
Answer: A Explanation: foreachBatch gives full control over the batch, allowing writes to multiple tables, external systems, or conditional logic. Question 46. Which of the following is NOT a supported file format for Delta Lake tables? A) Parquet B) ORC C) Avro D) CSV Answer: D Explanation: Delta Lake stores data in Parquet format internally; while you can read CSV, you cannot create a Delta table directly from CSV without conversion. Question 47. When using Z-ordering, which column(s) should you prioritize? A) Low-cardinality columns used in equality filters B) High-cardinality columns frequently used in range predicates C) Columns that are never filtered D) Timestamp columns only Answer: B Explanation: Z-ordering clusters data on high-cardinality columns that appear in range or equality predicates, improving file pruning. Question 48. Which of the following best explains “data skipping” in Delta Lake? A) Deleting rows that do not meet quality standards B) Ignoring files that do not contain data relevant to the query based on statistics and indexes C) Skipping the execution of a stage in a DAG D) Removing duplicate rows during a merge
B) Reduces cluster start-up latency by reusing pre-allocated instances C) Enables automatic scaling of storage capacity D) Provides built-in data encryption Answer: B Explanation: Instance pools keep a set of ready VMs, allowing clusters to acquire them quickly and reduce cold-start time. Question 52. What is the primary purpose of the spark.databricks.delta.optimizeWrite.enabled setting? A) To enable automatic file compaction during writes, reducing small file creation B) To force all writes to use a single executor C) To disable the Delta transaction log D) To enable write-through caching Answer: A Explanation: When enabled, Delta will coalesce output files during write operations, mitigating the small file problem. Question 53. Which of the following statements about the MERGE operation in Delta Lake is correct? A) MERGE can only be used on external tables. B) MERGE automatically resolves schema conflicts without any configuration. C) MERGE supports conditional updates, inserts, and deletes based on matching criteria. D) MERGE always creates a new version of the table regardless of the changes. Answer: C Explanation: MERGE allows you to specify WHEN MATCHED and WHEN NOT MATCHED clauses for fine-grained upserts and deletes.
Question 54. Which of these is a recommended way to reduce query cost when scanning a large Delta table with a selective filter on a high-cardinality column? A) Increase the number of shuffle partitions B) Apply Z-ordering on the filtered column and then run OPTIMIZE C) Disable caching for the table D) Use SELECT * and filter client-side Answer: B Explanation: Z-ordering clusters data on the filtered column, enabling data skipping and reducing the amount of data read. Question 55. In Databricks, which feature provides a unified view of tables, files, and ML models under a single governance framework? A) Delta Sharing B) Unity Catalog C) Lakeflow D) Databricks Asset Bundles Answer: B Explanation: Unity Catalog centralizes governance for data assets, notebooks, and ML artifacts. Question 56. Which of the following is true about the spark.databricks.delta.autoCompact.enabled setting? A) It automatically triggers VACUUM after each write. B) It enables background compaction of small files without user intervention. C) It disables the Delta transaction log. D) It forces all writes to use a single large file. Answer: B