































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
DATABRICKS CERTIFIED DATA ENGINEER ASSOCIATE 2026 EXAMCOMPLETECURRENT TESTING QUESTION AND DETAILED CORRECT ANSWER(VERIFIED) TOP-RATED A+. DATA ENGINEER Ace your Databricks Certified Data Engineer Associate exam by mastering data processing, ETL pipelines, and analytics on the Databricks Lakehouse Platform. This exam evaluates your ability to design and implement scalable, production-ready data solutions using Apache Spark and Delta Lake. It is specifically designed to validate your skills for professional roles in big data engineering and cloud analytics.
Typology: Exams
1 / 39
This page cannot be seen from the preview
Don't miss anything!
































Ace your Databricks Certified Data Engineer Associate exam by mastering data processing, ETL pipelines, and analytics on the Databricks Lakehouse Platform. This exam evaluates your ability to design and implement scalable, production-ready data solutions using Apache Spark and Delta Lake. It is specifically designed to validate your skills for professional roles in big data engineering and cloud analytics. A data engineering team has created a series of tables using Parquet data stored in an external system. The team is noticing that after appending new rows to the data in the external system, their queries within Databricks are not returning the new rows. They identify the caching of the previous data as the cause of this issue. Which of the following approaches will ensure that the data returned by queries is always up-to-date? A. The tables should be converted to the Delta format B. The tables should be stored in a cloud-based external system
C. The tables should be refreshed in the writing cluster before the next query is run D. The tables should be altered to include metadata to not cache E. The tables should be updated before the next query is run ✓ ✓ ...... ANSWER ....... A. The tables should be converted to the Delta format A table customerLocations exists with the following schema: id STRING, date STRING, city STRING, country STRING A senior data engineer wants to create a new table from this table using the following command: CREATE TABLE customersPerCountry AS SELECT country, COUNT(*) AS customers FROM customerLocations GROUP BY country; A junior data engineer asks why the schema is not being declared for the new table. Which of the following responses explains why declaring the schema is not necessary?
B. Overwriting a table results in a clean table history for logging and audit purposes. C. Overwriting a table maintains the old version of the table for Time Travel. D. Overwriting a table is an atomic operation and will not leave the table in an unfinished state. E. Overwriting a table allows for concurrent queries to be completed while in progress. ✓ ✓ ...... ANSWER ....... B. Overwriting a table results in a clean table history for logging and audit purposes. Which of the following commands will return records from an existing Delta table my_table where duplicates have been removed? A. DROP DUPLICATES FROM my_table; B. SELECT * FROM my_table WHERE duplicate = False; C. SELECT DISTINCT * FROM my_table; D. MERGE INTO my_table a USING new_records b ON a.id = b.id WHEN NOT MATCHED THEN INSERT *; E. MERGE INTO my_table a USING new_records b; ✓ ✓ ...... ANSWER ....... C. SELECT DISTINCT * FROM my_table;
A data engineer wants to horizontally combine two tables as a part of a query. They want to use a shared column as a key column, and they only want the query result to contain rows whose value in the key column is present in both tables. Which of the following SQL commands can they use to accomplish this task? A. INNER JOIN B. OUTER JOIN C. LEFT JOIN D. MERGE E. UNION ✓ ✓ ...... ANSWER ....... A. INNER JOIN A junior data engineer has ingested a JSON file into a table raw_table with the following schema: cart_id STRING, items ARRAY The junior data engineer would like to unnest the items column in raw_table to result in a new table with the following schema: cart_id STRING, item_id STRING
A. SELECT transaction_id, explode(payload) FROM raw_table; B. SELECT transaction_id, payload.date FROM raw_table; C. SELECT transaction_id, date FROM raw_table; D. SELECT transaction_id, payload[date] FROM raw_table; E. SELECT transaction_id, date from payload FROM raw_table; ✓ ✓ ...... ANSWER ....... B. SELECT transaction_id, payload.date FROM raw_table; A data analyst has provided a data engineering team with the following Spark SQL query: SELECT district, avg(sales) FROM store_sales_ GROUP BY district; The data analyst would like the data engineering team to run this query every day. The date at the end of the table name (20220101) should automatically be replaced with the current date each time the query is run. Which of the following approaches could be used by the data engineering team to efficiently automate this process? A. They could wrap the query using PySpark and use Python's string variable system to automatically update the table name.
B. They could manually replace the date within the table name with the current day's date. C. They could request that the data analyst rewrites the query to be run less frequently. D. They could replace the string-formatted date in the table with a timestamp-formatted date. E. They could pass the table int ✓ ✓ ...... ANSWER ....... A. They could wrap the query using PySpark and use Python's string variable system to automatically update the table name. A data engineer has ingested data from an external source into a PySpark DataFrame raw_df. They need to briefly make this data available in SQL for a data analyst to perform a quality assurance check on the data. Which of the following commands should the data engineer run to make this data available in SQL for only the remainder of the Spark session? A. raw_df.createOrReplaceTempView("raw_df") B. raw_df.createTable("raw_df") C. raw_df.write.save("raw_df") D. raw_df.saveAsTable("raw_df")
.read .schema(schema) .format("cloudFiles") .option("cloudFiles.format", "json") .load(dataSource) ) The code block is returning an error. Which of the following changes should be made to the code block to configure the block to successfully perform a streaming read? A. The .read line should be replaced with .readStream. B. A new .stream line should be added after the .read line. C. The .format("cloudFiles") line should be replaced with .format("stream"). D. A new .stream line should be added after the spark line. E. A new .stream line should be added after the .load(dataSource) line. ✓ ✓ ...... ANSWER ....... A. The .read line should be replaced with .readStream. A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a
streaming write into a new table. The code block used by the data engineer is below: (spark.table("sales") .withColumn("avg_price", col("sales") / col("units")) .writeStream .option("checkpointLocation", checkpointPath) .outputMode("complete") ._____ .table("new_sales") ) If the data engineer only wants the query to execute a single micro-batch to process all of the available data, which of the following lines of code should the data engineer use to fill in the blank? A. trigger(once=True) B. trigger(continuous="once") C. processingTime("once") D. trigger(processingTime="once") E. processingTime(1) ✓ ✓ ...... ANSWER ....... A. trigger(once=True)
A. Data skipping B. Z-Ordering C. Bin-packing D. Write as a Parquet file E. Tuning the file size ✓ ✓ ...... ANSWER ....... B. Z- Ordering A data engineer needs to create a database called customer at the location /customer/customer360. The data engineer is unsure if one of their colleagues has already created the database. Which of the following commands should the data engineer run to complete this task? A. CREATE DATABASE customer360 LOCATION '/customer/customer360'; B. CREATE DATABASE IF NOT EXISTS customer360; C. CREATE DATABASE IF NOT EXISTS customer360 LOCATION '/customer/customer360'; D. CREATE DATABASE IF NOT EXISTS customer360 DELTA LOCATION '/customer/customer360'; E. CREATE DATABASE customer360 DELTA LOCATION '/customer/customer360'; ✓ ✓ ...... ANSWER ....... C.
CREATE DATABASE IF NOT EXISTS customer360 LOCATION '/customer/customer360'; A junior data engineer needs to create a Spark SQL table my_table for which Spark manages both the data and the metadata. The metadata and data should also be stored in the Databricks Filesystem (DBFS). Which of the following commands should a senior data engineer share with the junior data engineer to complete this task? A. CREATE TABLE my_table (id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (PATH "storage-path"); B. CREATE MANAGED TABLE my_table (id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (PATH "storage- path"); C. CREATE MANAGED TABLE my_table (id STRING, value STRING); D. CREATE TABLE my_table (id STRING, value STRING) USING DBFS; E. CREATE TABLE my_table (id STRING, value STRING); ✓ ✓ ...... ANSWER ....... E. CREATE TABLE my_table (id STRING, value STRING);
Assuming that schemaLocation and sourcePath have been set correctly, which of the following changes does the data engineer need to make to convert this code block to use Auto Loader to ingest the data? A. The data engineer needs to change the format("cloudFiles") line to format("autoLoader"). B. There is no change required. Databricks automatically uses Auto Loader for streaming reads. C. There is no change required. The inclusion of format("cloudFiles") enables the use of Auto Loader. D. The dat ✓ ✓ ...... ANSWER ....... C. There is no change required. The inclusion of format("cloudFiles") enables the use of Auto Loader. Which of the following data workloads will utilize a Bronze table as its source? A. A job that aggregates cleaned data to create standard summary statistics B. A job that queries aggregated data to publish key insights into a dashboard C. A job that ingests raw data from a streaming source into the Lakehouse
D. A job that develops a feature set for a machine learning application E. A job that enriches data by parsing its timestamps into a human-readable format ✓ ✓ ...... ANSWER ....... E. A job that enriches data by parsing its timestamps into a human- readable format Which of the following data workloads will utilize a Silver table as its source? A. A job that enriches data by parsing its timestamps into a human-readable format B. A job that queries aggregated data that already feeds into a dashboard C. A job that ingests raw data from a streaming source into the Lakehouse D. A job that aggregates cleaned data to create standard summary statistics E. A job that cleans data by removing malformatted records ✓ ✓ ...... ANSWER ....... D. A job that aggregates cleaned data to create standard summary statistics
.writeStream .option("checkpointLocation", checkpointPath) .outputMode("append") .table("cleanedSales") ) D. (spark.readStream.load(rawSalesLocation) .writeStream .option("checkpointLocation", checkpointPath) .outputMode("append") .table("uncleanedSales") ) E. (spark.read.load(rawSalesLocation) .writeStream .option("checkpointLocation", checkpointPath) .outputMode(" ✓ ✓ ...... ANSWER ....... C. (spark.table("sales") .withColumn("avgPrice", col("sales") / col("units")) .writeStream .option("checkpointLocation", checkpointPath)
.outputMode("append") .table("cleanedSales") ) Which of the following benefits does Delta Live Tables provide for ELT pipelines over standard data pipelines that utilize Spark and Delta Lake on Databricks? A. The ability to declare and maintain data table dependencies B. The ability to write pipelines in Python and/or SQL C. The ability to access previous versions of data tables D. The ability to automatically scale compute resources E. The ability to perform batch and streaming queries ✓ ✓ ...... ANSWER ....... A. The ability to declare and maintain data table dependencies A data engineer has three notebooks in an ELT pipeline. The notebooks need to be executed in a specific order for the pipeline to complete successfully. The data engineer would like to use Delta Live Tables to manage this process. Which of the following steps must the data engineer take as part of implementing this pipeline using Delta Live Tables?