






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Databricks Data Engineering Associate-KH. Questions and Answers
Typology: Exams
1 / 12
This page cannot be seen from the preview
Don't miss anything!







Question 1 Which of the following describes a benefit of a data Lakehouse that is unavailable in a traditional data warehouse? A. A data Lakehouse provides a relational system of data management. B. A data Lakehouse captures snapshots of data for version control purposes. C. A data Lakehouse couples storage and compute for complete control. D. A data Lakehouse utilizes proprietary storage formats for data. E. A data Lakehouse enables both batch and streaming analytics. โ answer E. A data Lakehouse enables both batch and streaming analytics. Question 2 Which of the following locations hosts the driver and worker nodes of a Databricks- managed cluster? A. Data plane B. Control plane C. Databricks Filesystem D. JDBC data source E. Databricks web application - answerA. Data plane Question 3 A data architect is designing a data model that works for both video-based machine learning workloads and highly audited batch ETL/ELT workloads. Which of the following describes how using a data lakehouse can help the data architect meet the needs of both workloads? A. A data lakehouse requires very little data modeling. B. A data lakehouse combines compute and storage for simple governance. C. A data lakehouse provides autoscaling for compute clusters. D. A data lakehouse stores uns - answerD. A data lakehouse stores unstructured data and is ACID-compliant. Question 4 Which of the following describes a scenario in which a data engineer will want to use a Job cluster instead of an all-purpose cluster? A. An ad-hoc analytics report needs to be developed while minimizing compute costs. B. A data team needs to collaborate on the development of a machine learning model. C. An automated workflow needs to be run every 30 minutes. D. A Databricks SQL query needs to be scheduled for upward reporting.
E. A data engineer needs to manually investigate a produc - answerC. An automated workflow needs to be run every 30 minutes. Question 5 A data engineer has created a Delta table as part of a data pipeline. Downstream data analysts now need SELECT permission on the Delta table. Assuming the data engineer is the Delta table owner, which part of the Databricks Lakehouse Platform can the data engineer use to grant the data analysts the appropriate access? A. Repos B. Jobs C. Data Explorer D. Databricks Filesystem E. Dashboards - answerC. Data Explorer Question 6 Two junior data engineers are authoring separate parts of a single data pipeline notebook. They are working on separate Git branches so they can pair program on the same notebook simultaneously. A senior data engineer experienced in Databricks suggests there is a better alternative for this type of collaboration. Which of the following supports the senior data engineer's claim? A. Databricks Notebooks support automatic change-tracking and versioning B. Databricks Notebooks support rea - answerB. Databricks Notebooks support real- time coauthoring on a single notebook Question 7 Which of the following describes how Databricks Repos can help facilitate CI/CD workflows on the Databricks Lakehouse Platform? A. Databricks Repos can facilitate the pull request, review, and approval process before merging branches B. Databricks Repos can merge changes from a secondary Git branch into a main Git branch C. Databricks Repos can be used to design, develop, and trigger Git automation pipelines D. Databricks Repos can store the single-source-of-truth Git repository E. Da - answerE. Databricks Repos can commit or push code changes to trigger a CI/CD process Question 8 Which of the following statements describes Delta Lake? A. Delta Lake is an open source analytics engine used for big data workloads. B. Delta Lake is an open format storage layer that delivers reliability, security, and performance. C. Delta Lake is an open source platform to help manage the complete machine learning lifecycle.
C. B - answerB. Z-Ordering Question 12 A data engineer needs to create a database called customer360 at the location /customer/customer360. The data engineer is unsure if one of their colleagues has already created the database. Which of the following commands should the data engineer run to complete this task? A. CREATE DATABASE customer360 LOCATION '/customer/customer360'; B. CREATE DATABASE IF NOT EXISTS customer360; C. CREATE DATABASE IF NOT EXISTS customer360 LOCATION '/customer/customer360'; D. CREATE DATABASE IF NO - answerC. CREATE DATABASE IF NOT EXISTS customer360 LOCATION '/customer/customer360'; Question 13 A junior data engineer needs to create a Spark SQL table my_table for which Spark manages both the data and the metadata. The metadata and data should also be stored in the Databricks Filesystem (DBFS). Which of the following commands should a senior data engineer share with the junior data engineer to complete this task? A. CREATE TABLE my_table (id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (PATH "storage-path"); B. CREATE MANAGED TABLE my_table (id STRING, va - answerE. CREATE TABLE my_table (id STRING, value STRING); Question 14 A data engineer wants to create a relational object by pulling data from two tables. The relational object must be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data. Which of the following relational objects should the data engineer create? A. View B. Temporary view C. Delta Table D. Database E. Spark SQL Table - answerA. View Question 15 A data engineering team has created a series of tables using Parquet data stored in an external system. The team is noticing that after appending new rows to the data in the external system, their queries within Databricks are not returning the new rows. They identify the caching of the previous data as the cause of this issue. Which of the following approaches will ensure that the data returned by queries is always up-to-date?
A. The tables should be converted to the Delta format B - answerA. The tables should be converted to the Delta format Question 16 A table customerLocations exists with the following schema: id STRING, date STRING, city STRING, country STRING A senior data engineer wants to create a new table from this table using the following command: CREATE TABLE customersPerCountry AS SELECT country, COUNT(*) AS customers FROM customerLocations GROUP BY country; A junior data engineer asks why the schema is not being declared for the new table. Which of the following responses explains why declaring the schema is not neces - answerA. CREATE TABLE AS SELECT statements adopt schema details from the source table and query. Question 17 A data engineer is overwriting data in a table by deleting the table and recreating the table. Another data engineer suggests that this is inefficient and the table should simply be overwritten instead. Which of the following reasons to overwrite the table instead of deleting and recreating the table is incorrect? A. Overwriting a table is efficient because no files need to be deleted. B. Overwriting a table results in a clean table history for logging and audit purposes. C. Overwri - answerB. Overwriting a table results in a clean table history for logging and audit purposes. Question 18 Which of the following commands will return records from an existing Delta table my_table where duplicates have been removed? A. DROP DUPLICATES FROM my_table; B. SELECT * FROM my_table WHERE duplicate = False; C. SELECT DISTINCT * FROM my_table; D. MERGE INTO my_table a USING new_records b ON a.id = b.id WHEN NOT MATCHED THEN INSERT *; E. MERGE INTO my_table a USING new_records b; - answerC. SELECT DISTINCT * FROM my_table; Question 19 A data engineer wants to horizontally combine two tables as a part of a query. They want to use a shared column as a key column, and they only want the query result to contain rows whose value in the key column is present in both tables.
A data engineer has ingested data from an external source into a PySpark DataFrame raw_df. They need to briefly make this data available in SQL for a data analyst to perform a quality assurance check on the data. Which of the following commands should the data engineer run to make this data available in SQL for only the remainder of the Spark session? A. raw_df.createOrReplaceTempView("raw_df") B. raw_df.createTable("raw_df") C. raw_df.write.save("raw_df") D. raw_df.saveAsTable("raw_ - answerA. raw_df.createOrReplaceTempView("raw_df") B. raw_df.createTable("raw_df") Question 24 A data engineer needs to dynamically create a table name string using three Python variables: region, store, and year. An example of a table name is below when region = "nyc", store = "100", and year = "2021": nyc100_sales_ Which of the following commands should the data engineer use to construct the table name in Python? A. "{region}+{store}+sales+{year}" B. f"{region}+{store}+sales+{year}" C. "{region}{store}sales{year}" D. f"{region}{store}sales{year}" E. {region}+{sto - answerD. f"{region}{store}sales{year}" Question 25 A data engineer has developed a code block to perform a streaming read on a data source. The code block is below: (spark .read .schema(schema) .format("cloudFiles") .option("cloudFiles.format", "json") .load(dataSource) ) The code block is returning an error. Which of the following changes should be made to the code block to configure the block to successfully perform a streaming read? A. The .read line should be replaced with .readStream. B. A new .stream line should be added after - answerA. The .read line should be replaced with .readStream. Question 26 A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The code block used by the data engineer is below: (spark.table("sales") .withColumn("avg_price", col("sales") / col("units")) .writeStream
.option("checkpointLocation", checkpointPath) .outputMode("complete") ._____ .table("new_sales") ) If the data engineer only wants the query to execute a single micro-batch to process all of the - answerA. trigger(once=True) Question 27 A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run. Which of the following tools can the data engineer use to solve this problem? A. Databricks SQL B. D - answerE. Auto Loader Question 28 A data engineering team is in the process of converting their existing data pipeline to utilize Auto Loader for incremental processing in the ingestion of JSON files. One data engineer comes across the following code block in the Auto Loader documentation: (streaming_df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", schemaLocation) .load(sourcePath)) Assuming that schemaLocation and sourcePath have been set correctly, - answerC. There is no change required. The inclusion of format("cloudFiles") enables the use of Auto Loader. Question 29 Which of the following data workloads will utilize a Bronze table as its source? A. A job that aggregates cleaned data to create standard summary statistics B. A job that queries aggregated data to publish key insights into a dashboard C. A job that ingests raw data from a streaming source into the Lakehouse D. A job that develops a feature set for a machine learning application E. A job that enriches data by parsing its timestamps into a human-readable format - answerE. A job that enriches data by parsing its timestamps into a human-readable format Question 30 Which of the following data workloads will utilize a Silver table as its source? A. A job that enriches data by parsing its timestamps into a human-readable format B. A job that queries aggregated data that already feeds into a dashboard C. A job that ingests raw data from a streaming source into the Lakehouse D. A job that aggregates cleaned data to create standard summary statistics E. A job that cleans data by removing malformatted records - answerD. A job that aggregates cleaned data to create standard summary statistics
FROM json./path/to/json/file.json; The data engineer asks a colleague for help to convert this query for use in a Delta Live Tables (DLT) pipeline. The query should create the first table in the DLT pipeline. Which of the following describes the change the colleague needs to make to the query? A. They need to add a COMMENT line at the beginning of the query. B. They need to add a CREATE LIVE TABLE table_name AS line at the b - answerB. They need to add a CREATE LIVE TABLE table_name AS line at the beginning of the query. Question 35 A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') What is the expected behavior when a batch of data containing data that violates these constraints is processed? A. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. B. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event - answerA. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. Question 36 A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is configured to run in Development mode using the Triggered Pipeline Mode. Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline? A. All datasets will be updated once and the pipeline will shut down. The compute resourc - answerD. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing. Question 37 A data engineer has a Job with multiple tasks that runs nightly. One of the tasks unexpectedly fails during 10 percent of the runs. Which of the following actions can the data engineer perform to ensure the Job completes each night while minimizing compute costs? A. They can institute a retry policy for the entire Job B. They can observe the task as it runs to try and determine why it is failing C. They can set up the Job to run multiple times ensuring that at least one will complete
Job, and it starts at 12:30 AM. Sometimes, the second Job fails when the first Job does not complete by 12:30 AM. Which of the following approaches can the data engineer use to avoid this problem? A. They can utilize multiple tasks in a single job with a linear dependency B. They can use cluster pools to help th - answerA. They can utilize multiple tasks in a single job with a linear dependency Question 39 A data engineer has set up a notebook to automatically process using a Job. The data engineer's manager wants to version control the schedule due to its complexity. Which of the following approaches can the data engineer use to obtain a version- controllable configuration of the Job's schedule? A. They can link the Job to notebooks that are a part of a Databricks Repo. B. They can submit the Job once on a Job cluster. C. They can download the JSON description of the Job from the Job's - answerC. They can download the JSON description of the Job from the Job's page. Question 40 A data analyst has noticed that their Databricks SQL queries are running too slowly. They claim that this issue is affecting all of their sequentially run queries. They ask the data engineering team for help. The data engineering team notices that each of the queries uses the same SQL endpoint, but the SQL endpoint is not used by any other user. Which of the following approaches can the data engineering team use to improve the latency of the data analyst's queries? A. They can turn o - answerC. They can increase the cluster size of the SQL endpoint. Question 41 An engineering manager uses a Databricks SQL query to monitor their team's progress on fixes related to customer-reported bugs. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results. Which of the following approaches can the manager use to ensure the results of the query are updated each day? A. They can schedule the query to run every 1 day from the Jobs UI. B. They can schedule the query to refresh ever - answerB. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL. Question 42 A data engineering team has been using a Databricks SQL query to monitor the performance of an ELT job. The ELT job is triggered by a specific number of input records being ready to process. The Databricks SQL query returns the number of minutes since the job's most recent runtime. Which of the following approaches can enable the data engineering team to be notified if the ELT job has not been run in an hour?