





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Engineering Associate with Databricks Questions with Correct Answers
Typology: Exams
1 / 9
This page cannot be seen from the preview
Don't miss anything!






What is done with the control plane? - answer web application notebooks jobs cluster management What is done with the data plane? - answer spinning up clusters data processing accessing data sources SQL: Create table with specific columns - answer CREATE TABLE students (id INT, name STRING) SQL: Insert data - answerINSERT INTO students VALUES (1, "Omar") Does ACID guarantee inserting many records in a single transaction? - answerYes SQL: Update record - answerUPDATE students SET value = value + 1 WHERE name LIKE "T%" // starts with T SQL: Delete record - answerDELETE FROM students WHERE value > 6 // Boolean is lowercase // equals is single = What is upsert? - answerAllows updates, inserts and other manipulations in a single command SQL: Merge for CDC - answerMERGE INTO students b USING updates u ON b.id=u.id WHEN MATCHED AND u.type = "update" THEN UPDATE SET * THEN INSERT *
Which metastore does databricks use by default? - answerHive SQL: See details about Delta table including number of files - answerDESCRIBE DETAIL students What does OPTIMIZE do? - answerReplace existing data files by combining files and rewriting results SQL: OPTIMIZE - answerOPTIMIZE students ZORDER BY id What are the configs for vacuuming? - answerSET spark.databricks.delta.retentionDurationCheck.enabled = false; SET spark.databricks.delta.vacuum.logging.enabled = true; What does ZORDER do? - answerspeeds up data retrieval when filling on provided fields by colocating data SQL: View history - answerDESCRIBE HISTORY students SQL: Time travel - answerSELECT * FROM students VERSION AS OF 3 SQL: Restore table - answerRESTORE TABLE students TO VERSION AS OF 8 What is the default time to vacuum files? - answer7 days SQL: Vacuum - answerVACUUM students RETAIN 0 HOURS What does DRY RUN do with VACUUM? - answerDisplays all files that will be deleted What does the LOCATION tag do? - answerIf you set it, you can define a location for the data to reside, otherwise it saves in /user/hive/warehouse/ SQL: How do you specify a command to run in a database - answerUSE dbname; CREATE OR REPLACE TABLE SQL: How do you describe the table? - answerDESCRIBE TABLE EXTENDED DESCRIBE EXTENDED
Dbtable User Password ) When should you use CTAS? - answerAutomatically infer schema Useful for external data ingestion from sources with well-defined schemas SQL: Generated columns - answerCREATE TABLE purchase_dates ( Id STRING transcation_timestamp STRING, Price STRING, Date DATE GENERATED ALWAYS AS ( Cast(cast(transaction_timestamp/1e6 AS TIMESTAMP) AS DATE)) Comment "generated based on trans_to columns" How can constraints be used? - answerDelta Lake enforces schema on write, Databricks can support standard SQL constraint management clauses to ensure the quality and integrity of data added to the table. Shown in table properties of describe extended SQL: Constraints - answerALTER TABLE purchase_dates ADD CONSTRAINT valid_date CHECK (date > '2020-01-01') What's the difference between DEEP CLONE and SHALLOW CLONE? - answerDEEP CLONE fully copies the data and metadata from source to target. It occurs incrementally, so executing this command again can sync changes. SHALLOW CLONE copies the Delta transaction logs, meaning the data doesn't move SQL: Clone - answerCREATE TABLE name DEEP CLONE source CREATE TABLE name SHALLOW CLONE source What are the writing to Delta functions? - answerOverwrite - OVERWRITE, INSERT OVERWRITE
Append - INSERT INTO
PySpark: Writestream - answerSpark.table().writestream.option(checkpoint location).outputmode().trigger.table() What are the SQL functions for timestamp and file name? - answerCurrent_timestamp() Input_file_name() When should you use Auto Loader? - answerFor incremental ingestion SQL: What is the syntax for a streaming table? - answerCREATE OR REFRESH STREAMING LIVE TABLE name What's the difference between a data lake house, data warehouse and data lake? - answerStore data in an open data lake and get quality performance and security. ACID transactions, schema enforcement and evolution Enables BI, AI How is work processed across nodes in a cluster? - answerDriver node (teacher) assigns the work to executors (workers) Workers perform tasks. What is the limitation with Databricks Notebooks version vs Repos? - answerNotebooks version live with the notebook and are lost when the notebook is deleted or exported. What's the difference between a View and CTE? - answerView can be temporary or persist, can be accessed across multiple SQL statements. CTE can be accessed within a SQL statement What are the general concepts, triggers and watermarks for Structured Streaming? - answerWatermarks - dropping of old state data Triggers - time windows Express computations on streaming data What is the Auto Loader used for? - answerStreaming reads What are the benefits to Delta Live Tables? - answerSimple declarative approach for building data pipelines Manage structure and equality Simplify batch and streaming What are the benefits to Unity Catalog? - answerUnified governance across clouds Data and AI assets What are the Table ACL permissions? - answerCatalog, database, table, view, function, any file
What are the Table ACLs roles? - answerDatabricks admin, catalog owner, database owner, table owner What are the Table ACL privileges? - answerALL PRIVILEGES, SELECT, MODIFY, READ_METADATA, USAGE, CREATE How do you set permissions with Databricks SQL? - answerGRANT USAGE DATABASE name USER name What are examples of narrow transformations? - answerFilter, contains, map because they only operate on a single partition Examples of trigger parameter - answerSpark.table().writestream.optin(checkpoint).outputmode(complete).trigger(availa bleNow=True).table().awaittermination() What are the differences with the trigger parameter? - answerProcessingTime - user intervals Once AvailableNow - process all data available then stop What is auto loader preferred for? - answerReading data off of cloud storage What is the target field in DLT? - answerDatabase where the result tables will be published SQL: How to start a DLT? - answerCREATE INCREMENTAL LIVE TABLE Cloud_files() SQL: What do you add to tables when using a DLT? - answerLIVE LIVE.table FROM STREAM() Which of the following describes a benefit of a data lakehouse that is unavailable in a traditional data warehouse? - answerA data lakehouse enables both batch and streaming analytics. A data engineer has created a Delta table as part of a data pipeline. Downstream data analysts now need SELECT permission on the Delta table. Assuming the data engineer is the Delta table owner, which part of the Databricks Lakehouse Platform can the data engineer use to grant the data analysts the appropriate access? - answerdata explorer