









































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Data Engineering Associate with Databricks Ultimate Exam is designed for professionals seeking expertise in data engineering workflows using Databricks technologies. The exam covers data pipelines, Apache Spark fundamentals, ETL processes, cloud data platforms, data transformation, SQL analytics, workflow automation, data governance, and big data processing techniques. It prepares candidates for modern data engineering roles focused on scalable analytics and cloud-based data solutions.
Typology: Exams
1 / 49
This page cannot be seen from the preview
Don't miss anything!










































Question 1. Which component of the Databricks Lakehouse architecture provides ACID transaction guarantees on cloud object storage? A) Unity Catalog B) Delta Lake C) Databricks SQL Warehouse D) Auto Loader Answer: B Explanation: Delta Lake adds a transaction log that enables ACID semantics on top of cloud object storage, ensuring consistency for reads and writes. Question 2. In Unity Catalog, what is the correct order of the three-level namespace? A) schema.table.catalog B) catalog.table.schema C) catalog.schema.table D) table.schema.catalog Answer: C Explanation: Unity Catalog organizes objects as catalog → schema → table, mirroring traditional database hierarchies. Question 3. Which type of Databricks cluster is optimized for scheduled, cost-effective batch jobs? A) All-Purpose Cluster B) Job Cluster C) SQL Warehouse D) Photon Cluster Answer: B Explanation: Job clusters are created on demand for a specific job, then terminated, reducing cost for batch workloads. Question 4. What does the Control Plane of Databricks primarily manage?
A) Storage of user data files B) Execution of Spark jobs C) UI, authentication, and metastore services D) Compute resources for notebooks Answer: C Explanation: The Control Plane hosts the UI, authentication, metastore, and other management services, while the Data Plane runs compute. Question 5. Which Delta Lake feature enables you to query a previous version of a table? A) Z-ordering B) Time Travel C) OPTIMIZE D) VACUUM Answer: B Explanation: Time Travel uses the transaction log to access historical snapshots of a table. Question 6. When using the COPY INTO command, what property makes the operation idempotent? A) Automatic schema inference B) File metadata tracking in the transaction log C) Streaming source checkpointing D) Auto-loader schema evolution Answer: B Explanation: COPY INTO records which files have been ingested in the Delta transaction log, preventing duplicate loads. Question 7. Which ingestion method is best suited for high-volume, continuously arriving files in cloud storage? A) COPY INTO
Answer: B Explanation: A LEFT ANTI JOIN filters rows that do not have a match on the right side. Question 11. Which higher-order function would you use to transform each element of an array column in Spark SQL? A) filter B) map C) transform D) explode Answer: C Explanation: transform(array, x -> expr) applies a transformation to every element, returning a new array. Question 12. When processing nested JSON, which Spark SQL syntax extracts the field address.city from a column named payload? A) payload.address.city B) payload["address"]["city"] C) payload.address["city"] D) payload['address']['city'] Answer: A Explanation: Dot notation (payload.address.city) navigates nested struct fields directly. Question 13. In the Medallion Architecture, which layer typically contains data that has been cleaned, enriched, and conformed to business rules? A) Bronze B) Silver
C) Gold D) Platinum Answer: B Explanation: The Silver layer stores refined data after initial cleaning and enrichment of the raw Bronze data. Question 14. Which Structured Streaming trigger processes all available data as soon as it arrives, without waiting for a time interval? A) ProcessingTime('5 minutes') B) AvailableNow C) Continuous D) Trigger.Once Answer: B Explanation: AvailableNow processes all data that is currently available and then stops, suitable for micro-batch workloads. Question 15. What is the purpose of checkpointing in Structured Streaming? A) Persisting final query results to a table B) Storing the streaming query’s state to enable exactly-once processing after failures C) Compressing output files for storage efficiency D) Scheduling periodic job runs Answer: B Explanation: Checkpoints record offsets and state, allowing the stream to recover without data duplication. Question 16. Which Delta Live Tables (DLT) construct defines a table that continuously processes streaming data? A) CREATE LIVE TABLE … USING STREAMING B) CREATE LIVE TABLE … AS SELECT … FROM STREAMING C) CREATE LIVE TABLE … PIPELINE
Answer: B Explanation: GRANT USAGE on the schema allows visibility, and GRANT SELECT provides read access to its tables. Question 20. Which Unity Catalog feature enables row-level security based on the current user’s identity? A) Column masking policies B) Data masking rules C) Row access policies (RLS) D) Table ownership Answer: C Explanation: Row access policies evaluate expressions using CURRENT_USER() to filter rows per user. Question 21. Which system table would you query to retrieve audit logs of catalog access events? A) system.billing.usage B) system.access.audit C) system.jobs.history D) system.delta.log Answer: B Explanation: system.access.audit contains records of access and permission changes in Unity Catalog. Question 22. What does the OPTIMIZE command with ZORDER BY achieve on a Delta table? A) Removes old files beyond the retention period B) Compacts small files and co-locates data based on the specified columns for faster queries C) Updates the table schema automatically D) Generates a materialized view for the table
Answer: B Explanation: OPTIMIZE rewrites data into larger files; ZORDER BY sorts data within those files to improve predicate pushdown. Question 23. Which of the following statements about Delta Lake’s transaction log (_delta_log) is FALSE? A) It stores JSON files that describe each commit. B) It enables time travel by replaying log entries. C) It is stored in the same cloud storage location as the table data. D) It can be manually edited to modify table history. Answer: D Explanation: The transaction log is immutable; manual edits are not supported and would corrupt table integrity. Question 24. In Databricks Repos, what is the primary benefit of linking a notebook to a Git branch? A) Automatic scaling of clusters B) Version control and collaborative development of notebook code C) Real-time data streaming D) Built-in data quality checks Answer: B Explanation: Repos integrate notebooks with Git, enabling commit, pull, and branch workflows. Question 25. Which compute resource is specifically designed for ad-hoc BI queries in Databricks? A) All-Purpose Cluster B) Job Cluster C) SQL Warehouse (now called SQL Compute) D) Photon Engine Answer: C
Explanation: Setting schemaEvolutionMode to addNewColumns allows new columns to be added without manual intervention. Question 29. In a Delta Live Tables pipeline, which statement correctly defines a “streaming” table? A) CREATE LIVE TABLE my_table AS SELECT * FROM source_table B) CREATE LIVE TABLE my_stream USING STREAMING AS SELECT * FROM source C) CREATE LIVE TABLE my_stream FROM FILES D) CREATE LIVE TABLE my_stream (STREAMING = TRUE) AS SELECT … Answer: B Explanation: The USING STREAMING clause designates the source as a streaming input. Question 30. Which of the following is NOT a valid DLT expectation mode? A) ALLOW B) DROP C) WARN D) FAIL Answer: C Explanation: DLT supports ALLOW, DROP, and FAIL; there is no WARN mode. Question 31. What does the GRANT MODIFY privilege allow a user to do in Unity Catalog? A) Read data from tables B) Insert, update, or delete rows in a table C) Change table schema definitions D) Create new catalogs Answer: B Explanation: MODIFY grants write operations on table data but not schema changes.
Question 32. Which Spark configuration enables Photon execution engine for faster query processing? A) spark.databricks.photon.enabled = true B) spark.sql.execution.photon.enabled = true C) spark.databricks.replicateToPhoton = true D) spark.sql.photon.optimizer = true Answer: A Explanation: Setting spark.databricks.photon.enabled activates the Photon engine on supported clusters. Question 33. In Databricks, what is the purpose of the VACUUM command? A) Compacts small files into larger ones. B) Deletes files that are no longer referenced by the Delta table. C) Optimizes query plans for joins. D) Enforces column-level security. Answer: B Explanation: VACUUM removes obsolete files from storage based on the retention policy. Question 34. Which of the following best explains “decoupling of storage and compute” in the Lakehouse model? A) Data is stored in a proprietary format that ties to a specific cluster type. B) Compute resources are provisioned on the same VM as storage. C) Data resides in cloud object storage independent of the compute clusters that process it. D) Storage and compute must be scaled together in lockstep. Answer: C Explanation: The Lakehouse stores data in cloud object storage (e.g., S3, ADLS) while compute clusters can be independently scaled.
A) Micro-batch processing every few seconds. B) Low-latency processing with sub-millisecond intervals. C) One-time batch execution. D) Automatic checkpoint removal. Answer: B Explanation: Continuous trigger processes data as soon as it arrives, targeting sub-millisecond latency. Question 39. Which of the following is a correct way to create a Delta table using SQL? A) CREATE TABLE my_table USING DELTA LOCATION '/mnt/data' B) CREATE DELTA TABLE my_table AS SELECT * FROM source C) CREATE TABLE my_table (id INT) USING DELTA; D) CREATE TABLE my_table AS DELTA SELECT * FROM source Answer: C Explanation: The USING DELTA clause specifies the storage format; the column definition is required for a managed table. Question 40. Which command registers a notebook as a job in Databricks Workflows? A) CREATE PIPELINE … FROM NOTEBOOK B) CREATE JOB my_job USING NOTEBOOK '/Path/To/Notebook' C) REGISTER NOTEBOOK my_notebook AS JOB D) ADD TASK my_task TO JOB my_job FROM NOTEBOOK Answer: B Explanation: CREATE JOB … USING NOTEBOOK defines a workflow job that runs the specified notebook. Question 41. What is the default retention period for Delta Lake’s history before VACUUM can delete files? A) 7 days
B) 30 days C) 90 days D) 1 day Answer: B Explanation: By default, Delta Lake retains history for 30 days to support time travel. Question 42. Which Unity Catalog privilege allows a user to create new tables within a schema? A) CREATE B) USAGE C) MODIFY D) SELECT Answer: A Explanation: The CREATE privilege on a schema permits creation of tables, views, and other objects. Question 43. In Delta Live Tables, how do you reference a previously defined DLT table in a new pipeline step? A) SELECT * FROM LIVE.my_table B) FROM REF(my_table) C) FROM TABLE(my_table) D) USING my_table AS SOURCE Answer: C Explanation: DLT tables are regular Delta tables, so FROM TABLE(my_table) references them. Question 44. Which of the following is NOT a valid trigger type for Databricks SQL Warehouse auto-scaling? A) Auto-pause after inactivity B) Auto-resume on query submission
Explanation: display() renders DataFrames as interactive tables and charts within notebooks. Question 48. Which of the following statements about schema enforcement in Delta Lake is FALSE? A) Inserts that do not match the table schema are rejected. B) Schema enforcement can be disabled per table. C) Schema enforcement applies only to streaming writes. D) ALTER TABLE can be used to evolve the schema. Answer: C Explanation: Schema enforcement applies to both batch and streaming writes. Question 49. Which Unity Catalog feature allows you to track data lineage across tables, views, and notebooks? A) Data Access Auditing B) Lineage Explorer in Catalog UI C) System.access.audit table D) Delta Log visualizer Answer: B Explanation: The Lineage Explorer visualizes how data moves through transformations. Question 50. What is the effect of setting spark.databricks.delta.autoCompact.maxFileSize? A) Limits the size of each file after auto-compact to the specified value. B) Sets the maximum number of files that can be compacted at once. C) Controls the maximum size of the transaction log. D) Determines the maximum size of a Spark partition. Answer: A Explanation: This config caps the size of files generated during automatic compaction.
Question 51. Which of the following is a valid way to read a Delta table as a stream in PySpark? A) spark.readStream.format("delta").load("/path/to/table") B) spark.read.format("delta").load("/path/to/table") C) spark.readStream.delta("/path/to/table") D) spark.read.deltaStream("/path/to/table") Answer: A Explanation: readStream.format("delta") creates a streaming DataFrame from a Delta source. Question 52. In Databricks SQL, which clause adds a comment to a column definition? A) COMMENT ‘text’ AFTER column_name B) column_name STRING COMMENT ‘text’ C) ALTER COLUMN column_name SET COMMENT ‘text’ D) COMMENT ON COLUMN table.column_name IS ‘text’ Answer: B Explanation: Column comments are specified inline after the data type. Question 53. Which of the following best describes the purpose of a “Lakeflow Connect” connector? A) It provides a UI for manual file uploads. B) It offers managed, no-code ingestion from SaaS applications into Delta tables. C) It enables direct JDBC connections to external databases. D) It automates schema migration between Delta versions. Answer: B Explanation: Lakeflow Connect supplies pre-built connectors for SaaS sources, handling incremental loads. Question 54. Which statement about Unity Catalog’s Service Principals is TRUE?
Question 57. In a DLT pipeline, which command would you use to reference a cloudFiles source for streaming ingestion? A) READ STREAM FROM cloudFiles('/mnt') B) CREATE LIVE TABLE raw USING CLOUDFILES('/mnt') C) CREATE LIVE TABLE raw USING STREAMING FROM cloudFiles('/mnt') D) CREATE LIVE TABLE raw FROM FILES('/mnt') Answer: C Explanation: USING STREAMING FROM cloudFiles declares a streaming source based on Auto Loader. Question 58. Which of the following best explains the purpose of the spark.databricks.io.cache.enabled setting? A) Caches Delta transaction logs locally on driver nodes. B) Enables in-memory caching of remote data files to reduce latency. C) Stores notebook outputs in a distributed cache. D) Activates GPU acceleration for compute. Answer: B Explanation: This setting allows Databricks to cache remote files (e.g., S3 objects) on local disks for faster reads. Question 59. Which of the following is NOT a supported data type for Delta Lake columns? A) ARRAY B) MAP C) BINARY D) GEOGRAPHY Answer: D Explanation: Delta Lake supports common types like ARRAY, MAP, and BINARY, but not a native GEOGRAPHY type.
Question 60. When you run DESCRIBE HISTORY, what information do you obtain? A) List of columns and data types. B) Detailed query execution plan. C) Sequence of commits with timestamps, user, and operation type. D) Current row count of the table. Answer: C Explanation: DESCRIBE HISTORY shows the Delta log’s commit history. Question 61. Which of the following best describes “schema evolution” in Delta Lake? A) Automatic deletion of columns that are no longer present in source data. B) Ability to add new columns or change column types without rewriting the entire table. C) Enforcing a strict, unchangeable schema at table creation. D) Converting Parquet files to Delta format. Answer: B Explanation: Schema evolution allows the table schema to be altered (e.g., add columns) as new data arrives. Question 62. Which Spark function can be used to explode an array column into multiple rows? A) flatten() B) explode() C) split() D) array_contains() Answer: B Explanation: explode(array_column) creates a new row for each element in the array. Question 63. In Databricks, what is the primary purpose of the “Data Plane”?