


































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
DATABRICKS - DATA ENGINEER ASSOCIATE EXAM 1 2024/2025
Typology: Exams
1 / 42
This page cannot be seen from the preview
Don't miss anything!



































You were asked to create a table that can store the below data,
Note: Databricks also supports partitioning using generated column The data engineering team noticed that one of the job fails randomly as a result of using spot instances, what feature in Jobs/Tasks can be used to address this issue so the job is more stable when using spot instances? A. Use Databrick REST API to monitor and restart the job B. Use Jobs runs, active runs UI section to monitor and restart the job C. Add second task and add a check condition to rerun the first task if it fails D. Restart the job cluster, job automatically restarts E. Add a retry policy to the task - Precise Answer ✔✔E. Add a retry policy to the task The answer is, Add a retry policy to the task Tasks in Jobs support Retry Policy, which can be used to retry a failed tasks, especially when using spot instance it is common to have failed executors or driver. What is the main difference between AUTO LOADER and COPY INTO? A. COPY INTO supports schema evolution. B. AUTO LOADER supports schema evolution. C. COPY INTO supports file notification when performing incremental loads. D. AUTO LOADER supports reading data from Apache Kafka E, AUTO LOADER Supports file notification when performing incremental loads. - Precise Answer ✔✔E, AUTO LOADER Supports file notification when performing incremental loads. Explanation Auto loader supports both directory listing and file notification but COPY INTO only supports directory listing.
A. Schema location is used to store user provided schema B. Schema location is used to identify the schema of target table C. AUTO LOADER does not require schema location, because its supports Schema evolution D. Schema location is used to store schema inferred by AUTO LOADER E. Schema location is used to identify the schema of target table and source table - Precise Answer ✔✔D. Schema location is used to store schema inferred by AUTO LOADER Explanation The answer is, Schema location is used to store schema inferred by AUTO LOADER, so the next time AUTO LOADER runs faster as does not need to infer the schema every single time by trying to use the last known schema. Auto Loader samples the first 50 GB or 1000 files that it discovers, whichever limit is crossed first. To avoid incurring this inference cost at every stream start up, and to be able to provide a stable schema across stream restarts, you must set the option cloudFiles.schemaLocation. Auto Loader creates a hidden directory _schemas at this location to track schema changes to the input data over time. The below link contains detailed documentation on different options Auto Loader options | Databricks on AWS Which of the following statements are incorrect about the lakehouse? A. Support end-to-end streaming and batch workloads B. Supports ACID
C. Support for diverse data types that can store both structured and unstructured D. Supports BI and Machine learning E. Storage is coupled with Compute - Precise Answer ✔✔E. Storage is coupled with Compute Explanation The answer is, Storage is coupled with Compute. The question was asking what is the incorrect option, in Lakehouse Storage is decoupled with compute so both can scale independently. What Is a Lakehouse? - The Databricks Blog You are designing a data model that works for both machine learning using images and Batch ETL/ELT workloads. Which of the following features of data lakehouse can help you meet the needs of both workloads? A. Data lakehouse requires very little data modeling. B. Data lakehouse combines compute and storage for simple governance. C. Data lakehouse provides autoscaling for compute clusters. D. Data lakehouse can store unstructured data and support ACID transactions. E. Data lakehouse fully exists in the cloud. - Precise Answer ✔✔D. Data lakehouse can store unstructured data and support ACID transactions. Explanation The answer is A data lakehouse stores unstructured data and is ACID-compliant, Which of the following locations in Databricks product architecture hosts jobs/pipelines and queries?
b. The job cluster is best suited for this purpose. c. Use Azure VM to read and write delta tables in Python (Incorrect) d. Use delta live table pipeline to run in continuous mode - Precise Answer ✔✔b. The job cluster is best suited for this purpose. Explanation The answer is, The Job cluster is best suited for this purpose. Since you don't need to interact with the notebook during the execution especially when it's a scheduled job, job cluster makes sense. Using an all-purpose cluster can be twice as expensive as a job cluster. FYI, When you run a job scheduler with option of creating a new cluster when the job is complete it terminates the cluster. You cannot restart a job cluster. Which of the following developer operations in CI/CD flow can be implemented in Databricks Repos? a. Merge when code is committed b. Pull request and review process c. Trigger Databricks Repos API to pull the latest version of code into production folder d. Resolve merge conflicts e. Delete a branch - Precise Answer ✔✔c. Trigger Databricks Repos API to pull the latest version of code into production folder
Explanation See the below diagram to understand the role Databricks Repos and Git provider plays when building a CI/CD workflow. All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are done in a git provider like Github or Azure DevOps You are currently working with the second team and both teams are looking to modify the same notebook, you noticed that the second member is copying the notebooks to the personal folder to edit and replace the collaboration notebook, which notebook feature do you recommend to make the process easier to collaborate. a. Databricks notebooks should be copied to a local machine and setup source control locally to version the notebooks b. Databricks notebooks support automatic change tracking and versioning c. Databricks Notebooks support real-time coauthoring on a single notebook d. Databricks notebooks can be exported into dbc archive files and stored in data lake e. Databricks notebook can be exported as HTML and imported at a later time - Precise Answer ✔✔c. Databricks Notebooks support real-time coauthoring on a single notebook Explanation Answer is Databricks Notebooks support real-time coauthoring on a single notebook Every change is saved, and a notebook can be changed my multiple users. You are currently working on a project that requires the use of SQL and Python in a given notebook, what would be your approach
Explanation Delta lake is · Open source · Builds up on standard data format · Optimized for cloud object storage · Built for scalable metadata handling Delta lake is not · Proprietary technology · Storage format · Storage medium · Database service or data warehouse You were asked to create or overwrite an existing delta table to store the below transaction data. | transactionId | transactionDate | unitsSold | 1 | 01-01-2021 09:10:24 AM | 100 | 2 | 01-01-2021 10:20:24 PM | 10 a. CREATE OR REPLACE DELTA TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) b. CREATE OR REPLACE TABLE IF EXISTS transactions ( transactionId int, transactionDate timestamp,
unitsSold int) FORMAT DELTA c. CREATE IF EXISTS REPLACE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) d. CREATE OR REPLACE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) - Precise Answer ✔✔d. CREATE OR REPLACE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) Explanation The answer is CREATE OR REPLACE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) When creating a table in Databricks by default the table is stored in DELTA format.
You noticed a colleague is manually copying the data to the backup folder prior to running an update command, incase if the update command did not provide the expected outcome so he can use the backup copy to replace table, which Delta Lake feature would you recommend simplifying the process? a. Use time travel feature to refer old data instead of manually copying b. Use DEEP CLONE to clone the table prior to update to make a backup copy c. Use SHADOW copy of the table as preferred backup choice d. Cloud object storage retains previous version of the file e. Cloud object storage automatically backups the data - Precise Answer ✔✔a. Use time travel feature to refer old data instead of manually copying Explanation The answer is, Use time travel feature to refer old data instead of manually copying. https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html SELECT count() FROM my_table TIMESTAMP AS OF "2019-01-01" SELECT count() FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1) SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01 01:30:00.000" Which one of the following is not a Databricks lake house object? a. Tables b. Views c. Database/Schemas d. Catalog
e. Functions f. Stored Procedures - Precise Answer ✔✔f. Stored Procedures Explanation The answer is, Stored Procedures. Databricks lakehouse does not support stored procedures. What type of table is created when you create delta table with below command? CREATE TABLE transactions USING DELTA LOCATION "DBFS:/mnt/bronze/transactions" a. Managed delta table b. External table c. Managed table d. Temp table e. Delta Lake table - Precise Answer ✔✔b. External table Explanation Anytime a table is created using the LOCATION keyword it is considered an external table, below is the current syntax. Syntax CREATE TABLE table_name ( column column_data_type...) USING format LOCATION "dbfs:/" format -> DELTA, JSON, CSV, PARQUET, TEXT I created the table command based on the above question, you can see it created an external table,
e. Temporary views are created in local_temp database - Precise Answer ✔✔a. Temporary views are lost once the notebook is detached and re-attached Explanation The answer is Temporary views are lost once the notebook is detached and attached There are two types of temporary views that can be created, Session scoped and Global A local/session scoped temporary view is only available with a spark session, so another notebook in the same cluster can not access it. if a notebook is detached and reattached local temporary view is lost. A global temporary view is available to all the notebooks in the cluster, if a cluster restarts global temporary view is lost. Which of the following is correct for the global temporary view? a. global temporary views cannot be accessed once the notebook is detached and attached b. global temporary views can be accessed across many clusters c. global temporary views can be still accessed even if the notebook is detached and attached d. global temporary views can be still accessed even if the cluster is restarted e. global temporary views are created in a database called temp database - Precise Answer ✔✔c. global temporary views can be still accessed even if the notebook is detached and attached Explanation
The answer is global temporary views can be still accessed even if the notebook is detached and attached There are two types of temporary views that can be created Local and Global · A local temporary view is only available with a spark session, so another notebook in the same cluster can not access it. if a notebook is detached and reattached local temporary view is lost. · A global temporary view is available to all the notebooks in the cluster, even if the notebook is detached and reattached it can still be accessible but if a cluster is restarted the global temporary view is lost. You are currently working on reloading customer_sales tables using the below query INSERT OVERWRITE customer_sales SELECT * FROM customers c INNER JOIN sales_monthly s on s.customer_id = c.customer_id After you ran the above command, the Marketing team quickly wanted to review the old data that was in the table. How does INSERT OVERWRITE impact the data in the <customer_sales> table if you want to see the previous version of the data prior to running the above statement? a. Overwrites the data in the table, all historical versions of the data, you can not time travel to previous versions b. Overwrites the data in the table but preserves all historical versions of the data, you can time travel to previous versions c. Overwrites the current version of the data but clears all historical versions of the data, so you can not time travel to previous versions.
Any DML/DDL operation(except DROP TABLE) on the Delta table preserves the historical version of the data. Which of the following SQL statement can be used to query a table by eliminating duplicate rows from the query results? a. SELECT DISTINCT * FROM table_name b. SELECT DISTINCT * FROM table_name HAVING COUNT() > 1 c. SELECT DISTINCT_ROWS () FROM table_name d. SELECT * FROM table_name GROUP BY * HAVING COUNT(*) < 1 e. SELECT * FROM table_name GROUP BY *
HAVING COUNT(*) > 1 - Precise Answer ✔✔a. SELECT DISTINCT * FROM table_name Which of the below SQL Statements can be used to create a SQL UDF to convert Celsius to Fahrenheit and vice versa, you need to pass two parameters to this function one, actual temperature, and the second that identifies if its needs to be converted to Fahrenheit or Celsius with a one-word letter F or C? select udf_convert(60,'C') will result in 15. select udf_convert(10,'F') will result in 50 a. CREATE UDF FUNCTION udf_convert(temp DOUBLE, measure STRING) RETURNS DOUBLE RETURN CASE WHEN measure == 'F' then (temp * 9/5) + 32 ELSE (temp - 33 ) * 5/ END b. CREATE UDF FUNCTION udf_convert(temp DOUBLE, measure STRING) RETURN CASE WHEN measure == 'F' then (temp * 9/5) + 32 ELSE (temp - 33 ) * 5/ END c. CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING) RETURN CASE WHEN measure == 'F' then (temp * 9/5) + 32 ELSE (temp - 33 ) * 5/ END