Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Databricks Associate Questions with Solutions, Exams of Advanced Education

Miami University - Oxford Advanced Education

Databricks Associate Questions with Solutions

Typology: Exams

2025/2026

Available from 06/01/2026

kamothojj 🇺🇸

5

(2)

16K documents

1 / 27

This page cannot be seen from the preview

Don't miss anything!

Databricks Associate

Bronze - answer Typically just a raw copy of ingested data

Replaces traditional data lake

Provides efficient storage and querying of full, unprocessed history of data

Silver - answer Filtered, Cleaned, Augmented

Reduces data storage complexity, latency, and redundancy Optimizes ETL throughput

and analytic query performance Preserves grain of original data (without aggregations)

Eliminates duplicate records Production schema enforced Data quality checks, corrupt

data quarantined

Gold - answerPowers ML applications, reporting, dashboards, ad hoc analytics

Refined views of data, typically with aggregations Reduces strain on production

systems

Optimizes query performance for business-critical data

What is a Live Table? - answerLive Tables are materialized views for the lakehouse

Defined by a SQL query • Created and kept up-to-date by a pipeline

Live Table Create - answerCREATE OR REPLACE/REFRSH LIVE TABLE report AS

SELECT sum(profit) FROM prod.sales

What is a Streaming Live Table? - answerStreaming Live tables compute results over

append-only streams such as Kafka, Kinesis, or Auto Loader (files on cloud storage)

• Streaming live tables allow you to reduce costs and latency by avoiding reprocessing

of old data

A streaming live table is "stateful": • Ensures exactly-once processing of input rows •

Inputs are only read once

Streaming Live Table Syntax - answerCREATE STREAMING LIVE TABLE report AS

SELECT sum(profit) FROM cloud_files(prod.sales)

Development Mode (DLT) - answerReuses a long-running cluster running for fast

iteration. • No retries on errors enabling faster debugging.

Production Mode (DLT) - answerCuts costs by turning off clusters as soon as they are

done (within 5 minutes) • Escalating retries, including cluster restarts, ensure reliability

in the face of transient issues.

Discover Exams of Advanced Education Miami University - Oxford

Partial preview of the text

Download Databricks Associate Questions with Solutions and more Exams Advanced Education in PDF only on Docsity!

Databricks Associate

Bronze - answer Typically just a raw copy of ingested data Replaces traditional data lake Provides efficient storage and querying of full, unprocessed history of data Silver - answer Filtered, Cleaned, Augmented Reduces data storage complexity, latency, and redundancy Optimizes ETL throughput and analytic query performance Preserves grain of original data (without aggregations) Eliminates duplicate records Production schema enforced Data quality checks, corrupt data quarantined Gold - answerPowers ML applications, reporting, dashboards, ad hoc analytics Refined views of data, typically with aggregations Reduces strain on production systems Optimizes query performance for business-critical data What is a Live Table? - answerLive Tables are materialized views for the lakehouse Defined by a SQL query • Created and kept up-to-date by a pipeline Live Table Create - answerCREATE OR REPLACE/REFRSH LIVE TABLE report AS SELECT sum(profit) FROM prod.sales What is a Streaming Live Table? - answerStreaming Live tables compute results over append-only streams such as Kafka, Kinesis, or Auto Loader (files on cloud storage)

Streaming live tables allow you to reduce costs and latency by avoiding reprocessing of old data A streaming live table is "stateful": • Ensures exactly-once processing of input rows • Inputs are only read once Streaming Live Table Syntax - answerCREATE STREAMING LIVE TABLE report AS SELECT sum(profit) FROM cloud_files(prod.sales) Development Mode (DLT) - answerReuses a long-running cluster running for fast iteration. • No retries on errors enabling faster debugging. Production Mode (DLT) - answerCuts costs by turning off clusters as soon as they are done (within 5 minutes) • Escalating retries, including cluster restarts, ensure reliability in the face of transient issues.

Declare LIVE Dependencies (DLT) - answer• Dependencies owned by other producers are just read from the catalog or spark data source as normal.

LIVE dependencies, from the same pipeline, are read from the LIVE schema.
DLT detects LIVE dependencies and executes all operations in correct order.
DLT handles parallelism and captures the lineage of the data Live Table Dependencies (DLT) - answerCREATE LIVE TABLE events AS SELECT ... FROM prod.raw_data CREATE LIVE TABLE report AS SELECT ... FROM LIVE.events Ensure correctness with Expectations (DLT) - answerExpectations are true/false expressions that are used to validate each row during processing. DLT offers flexible policies on how to handle records that violate expectations:
Track number of bad records
Drop bad records • Abort processing for a single bad record Expectations (SQL) - answerCONSTRAINT valid_timestamp EXPECT (timestamp > 'ù÷øù-÷ø-÷ø') ON VIOLATION DROP Expectations (Python) - [email protected]_or_drop( "valid_timestamp", col("timestamp") > 'ù÷øù-÷ø-÷ø') Pipelines UI - answerVisualize data flows between tables A one stop shop for ETL debugging and operations Discover metadata and quality of each table Access to historical updates Control operations Dive deep into events Event Log (DLT) - Operational Statistics - answerTime and current status, for all operations Pipeline and cluster configurations Row counts Event Log (DLT) - Provenance - answerTable schemas, definitions, and declared properties Table-level lineage Query plans used to update tables Event Log (DLT) - Data Quality - answerExpectation pass / failure / drop statistics Input/Output rows that caused expectation failures

Which of the approache can the data engineer take to accomplish this? Select one response. - answerThe data engineer can query events of type user_action from the event log. A data engineer wants to query metrics on the latest update made to their pipeline. They need to be able to see the event type and timestamp for each update. - answerThe data engineer can query the update ID from the events log where the event type is create_update. A data engineer wants to query metrics on the latest update made to their pipeline. The pipeline has multiple data sources. Despite the input data sources having low data retention, the data engineer needs to retain the results of the query indefinitely. - answerStreaming live table; streaming live tables can preserve data indefinitely. ) Which of the following correctly describes how to access contents of the table directory?

answerThe contents of the table directory can be viewed through the metastore. Which of the following data quality metrics are captured through row_epectations in a pipeline's event log? - answerName Dataset Failed Records Python vs SQL (DLT) API - answerPython API Proprietary SQL API Python vs SQL (DLT) Syntax Checks - answerPython - no dlt module, no syntax check SQL - has syntax checks In Python, if you run a DLT notebook cell on its own it will show in error, whereas in SQL it will check if the command is syntactically valid and tell you. In both cases, individual notebook cells are not supposed to be run for DLT pipelines. Python vs SQL (DLT) IMports - answerThe dlt module should be explicitly imported into your Python notebook libraries. In SQL, this is not the case. Python vs SQL (DLT) Tables - answerPython - Tables as DataFrames SQL - Tables as Query Results The Python DataFrame API allows for multiple transformations of a dataset by stringing multiple API calls together. Compared to SQL, those same transformations must be saved in temporary tables as they are transformed. Python vs SQL (DLT) Creation Syntax - answerPython - @dlt.table() SQL - The Select Statement

In SQL, the core logic of your query, containing transformations you make to your data, is contained in the SELECT statement. In Python, data transformations are specified when you configure options for @dlt.table(). Python vs SQL (DLT) Commenting - answerPython - @dlt.table(comment = "Python comment",table_properties = {"quality": "silver"}) SQL - COMMENT "SQL comment" TBLPROPERTIES ("quality" = "silver") This is how you add comments and table properties in Python vs. SQL Query Execution with Delta Lake - answerDLT uses Delta Lake to store all tables, each time a query is executed, we will always return the most recent version of the table. But queries outside of DLT will return snapshot results from DLT tables, regardless of how they were defined. Directories for DLT - answerThe autoloader and checkpoint directories contain data used to manage incremental data processing with Structured Streaming. The system directory captures events associated with the pipeline. Querying Eent Log syntax - answerdisplay(spark.sql(f"SELECT * FROM delta.{DA.paths.storage_location}/system/events")) Event_Types Values - answercluster_resources planning_information flow_progress user_action create_update update_progress Event Log Raw - answerevent_log_path = f"{DA.paths.storage_location}/system/events" Set Latest Update ID (DLT Event Logs) - answerlatest_update_id = spark.sql(""" SELECT origin.update_id FROM event_log_raw WHERE event_type = 'create_update' ORDER BY timestamp DESC LIMIT 1""").first().update_id print(f"Latest Update ID: {latest_update_id}")

Push back into the spark config so that we can use it in a later query.

spark.conf.set('latest_update.id', latest_update_id) Perform Audit Logging (DLT Event Logs) - answerEvents related to running pipelines and editing configurations are captured as user_action.

Auto Loader - answerAuto Loader functionality to provide optimized execution for incrementally loading data from cloud object storage into Delta Lake Auto Loader will automatically detect new data files as they land in the source cloud object storage location, incrementally processing new records without the need to perform expensive scans and recomputing results for infinitely growing datasets. cloud_files() - answerenables Auto Loader to be used natively with SQL Params: The source location, which should be cloud-based object storage The source data format, which is JSON in this case An arbitrarily sized comma-separated list of optional reader options. In this case, we set cloudFiles.inferColumnTypes to true Auto loader syntax example - answerCREATE OR REFRESH STREAMING LIVE TABLE orders_bronze AS SELECT current_timestamp() processing_time, input_file_name() source_file, * FROM cloud_files("${source}/orders", "json", map("cloudFiles.inferColumnTypes", "true")) DLT FROM clause has two constructs - answerThe LIVE keyword is used in place of the schema name to refer to the target schema configured for the current DLT pipeline The STREAM method allows users to declare a streaming data source for SQL queries TBLPROPERTIES - answerset the value silver for the key quality CREATE SILVER DLT TABLE - answerCREATE OR REFRESH STREAMING LIVE TABLE orders_silver (CONSTRAINT valid_date EXPECT (order_timestamp > "2021-01-01") ON VIOLATION FAIL UPDATE) COMMENT "Append only orders with valid timestamps" TBLPROPERTIES ("quality" = "silver") AS SELECT timestamp(order_timestamp) AS order_timestamp, * EXCEPT (order_timestamp, source_file, _rescued_data) FROM STREAM(LIVE.orders_bronze) Difference between Live Tables and Streaming Live Tables - answerLIVE: Always "correct", meaning their contents will match their definition after any update. Return same results as if table had just been defined for first time on all data. Should not be modified by operations external to the DLT Pipeline (you'll either get undefined answers or your change will just be undone). STREAMING LIVE:

Only supports reading from "append-only" streaming sources. Only reads each input batch once, no matter what (even if joined dimensions change, or if the query definition changes, etc). Can perform operations on the table outside the managed DLT Pipeline (append data, perform GDPR, etc). omits the Auto Loader option to infer schema - answerfields will have the correct names but will all be stored as STRING type. Create streaming table from autoloader json file - answerCREATE OR REFRESH STREAMING LIVE TABLE customers_bronze COMMENT "Raw data from customers CDC feed" AS SELECT current_timestamp() processing_time, input_file_name() source_file, * FROM cloud_files("${source}/customers", "json") APPLY CHANGES INTO - answerPerforms incremental/streaming ingestion of CDC data Provides simple syntax to specify one or many fields as the primary key for a table Default assumption is that rows will contain inserts and updates Can optionally apply deletes Automatically orders late-arriving records using user-provided sequencing key Uses a simple syntax for specifying columns to ignore with the EXCEPT keyword Will default to applying changes as Type 1 SCD data is being updated and deleted in table so it breaks - answerthe append-only requirements for streaming live table sources Views have the same update guarantees as live tables, but the results of queries - answerare not stored to disk. DLT views are not persisted to the metastore, meaning that they can only be referenced from within the DLT pipeline they are a part of Within a DLT Pipeline, code in any notebook library can reference - answertables and views created in any other notebook library. An approach that can work well is to fix one table at a time, - answerstarting at your earliest dataset and working toward your final @dlt.table - answeradded to any Python function that returns a Spark DataFrame. (NOTE: This includes Koalas DataFrames, ) @dlt.tabledef ():return () DLT Autoloader SYntax for Python - [email protected]

Python - name option for DLT table declaration - answerThe option allows developers to specify the name for the resultant table separate from the function definition that creates the DataFrame the table is defined from. @dlt.table( name = "customers_bronze", comment = "Raw data from customers CDC feed" ) def ingest_customers_cdc(): return ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load(f"{source}/customers") .select( F.current_timestamp().alias("processing_time"), F.input_file_name().alias("source_file"), "*" ) ) Python - DLT Validation Constraint syntax - [email protected] @dlt.expect_or_fail("valid_id", "customer_id IS NOT NULL") @dlt.expect_or_drop("valid_operation", "operation IS NOT NULL") @dlt.expect("valid_name", "name IS NOT NULL or operation = 'DELETE'") @dlt.expect("valid_adress", """ (address IS NOT NULL and city IS NOT NULL and state IS NOT NULL and zip_code IS NOT NULL) or operation = "DELETE" """) @dlt.expect_or_drop("valid_email", """ rlike(email, '^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$') or operation = "DELETE" """) def customers_bronze_clean(): return ( dlt.read_stream("customers_bronze") ) DLT Apply Changes Python Sytnax - answerdlt.create_target_table( name = "customers_silver")

dlt.apply_changes( target = "customers_silver", source = "customers_bronze_clean", keys = ["customer_id"], sequence_by = F.col("timestamp"), apply_as_deletes = F.expr("operation = 'DELETE'"), except_column_list = ["operation", "source_file", "_rescued_data"]) DLT View Python Sytnax - [email protected] def subscribed_order_emails_v(): return ( dlt.read("orders_silver").filter("notifications = 'Y'").alias("a") .join( dlt.read("customers_silver").alias("b"), on="customer_id" ).select( "a.customer_id", "a.order_id", "b.email" ) ) When errors can occur with DLT - answerInitializing or Setting Up Stage dlt.read_stream() - answerThe dlt.read_stream() method is meant only for use if you're using Delta Live Tables (DLT) to create your ETL/ELT pipeline. A data engineer has built and deployed a DLT pipeline. They want to perform an update that writes a batch of data to the output directory. - answerWith each triggered update, all newly arriving data will be processed through their pipeline. Metrics will always be reported for the current run. Workflows - answeris a fully-managed cloud based general purpose task orchestration service for the entire Lakehouse. Workflows is a service for data engineers, data scientists and analysts to build reliable data, analytics and AI workflows on any cloud. Databricks Workflows has two main task orchestration services; - answerWorkflow Jobs (Workflows): Workflows for every job.

Delta Live Tables (DLT): Automated data pipelines for Delta Lake DLT pipeline can be a task in a Workflow

Databricks Unity Catalog - answerUnify governance across clouds Fine-grained governance for data lakes across clouds - based on open standard ANSI SQL Unify data and AI assets Centrally share, audit, secure and manage all data types with one simple interface. Unify existing catalogs Works in concert with existing data, storage, and catalogs - no hard migration require Metastore - answerControl Plane + Cloud Storage Unity Catalog three-level - answernamespace Table Types - answer-External (unmanged) -Managed Lowest level of UC hierarchy - answer-Table -View -Function Level under metastore in UC Heirarchy - answer-Storage Credentials -External Location -Catalog -Share -Recipient UC Architecture - answerUser/group management | Metastore | Access controls ^ | Workspace 1 || Workspace 2 Cloud Administrator - answerAdministers underlying cloud resources • Storage accounts/buckets • IAM role/service principals/Managed Identities Identity Administrator - answerAdministers underlying identity provider service (when in use) • Identity provider provisions users and groups into the account • Avoids need to manually create and manage identities Account Administrator - answerAdministers the account • Creates, deletes, and assigns metastores to workspaces • Creates, deletes, and assigns users and groups to workspaces • Integrates account with an identity provider • Full access to all data objects

Metastore Administrator - answerAdministers a metastore • Creates and drops catalogs and other data objects • Grants privileges on data objects • Changes ownerships of data objects • Designated by an account administrator Data Owner - answerOwns data objects they created • Creates nested objects • Grants privileges to others on owned objects • Changes ownership of owned objects Workspace Administrator - answerAdministers a workspace • Manages permissions on workspace assets • Restricts access to cluster creation • Adds or removes users • Elevates users permissions • Grant privileges to others • Change job ownership Workspace Identities - answer-User -Account admin -Service Principal (w/ or w/o admin privelages) -Groups (w/ or w/o nesting groups) UC Security Model - Query Lifecycle - answer1) Send Query (Principal)

Check namespaces, grants, metadata (Compute) <--> (Cloud Storage)
Assume IAM Role or Service Principal
Return short lived token and signed url
Return data from URL with short lived token
Return data
Enforce policies (Audit Logs) Principal -> COmpute -> Cloud storage service principal - answerdifferent type of individual identity intended for use with automation tools and running jobs. These are assigned a name by an account administror, though they are identified through a globally unique identifier (GUID) that is dynamically generated when the identity is created. Service principals authenticate to a workspace using a token and access functionality through APIs. Single user clusters - what's allowed - answerMultiple language support, not shareable Support all features (init scripts, mounts, dbfs scripts) -RDD API -DBFS Fuse mounts -Init scripts and libraries -Machine learning User isolation Clusters - answerSQL Python only

is_account_group_member() - answerreturns TRUE if the user querying the view is a member of the specified group is_member() - answerplease refrain from using the legacy function is_member(), which references workspace-level groups. This is not good practice in the context of Unity Catalog. Dynamic views create statement - answerCREATE OR REPLACE VIEW agg_heartrate AS SELECT CASE WHEN is_account_group_member('analysts') THEN 'REDACTED' ELSE mrn END AS mrn, CASE WHEN is_account_group_member('analysts') THEN 'REDACTED' ELSE name END AS name, MEAN(heartrate) avg_heartrate, DATE_TRUNC("DD", time) date FROM heartrate_device GROUP BY mrn, name, DATE_TRUNC("DD", time); -- Re-issue the grant -- GRANT SELECT ON VIEW agg_heartrate to analysts CREATE OR REPLACE VIEW agg_heartrate AS SELECT mrn, time,x device_id, heartrate FROM heartrate_device WHERE CASE WHEN is_account_group_member('analysts') THEN device_id < 30 ELSE TRUE END; -- Re-issue the grant -- GRANT SELECT ON VIEW agg_heartrate to analysts mask() - answeruser-defined function that we created earlier to mask the mrn column for our analaysts Dynamic views create statement (masking) - answerCREATE OR REPLACE VIEW agg_heartrate AS

SELECT

CASE WHEN

is_account_group_member('analysts') THEN mask(mrn) ELSE mrn END AS mrn, time, device_id, heartrate FROM heartrate_device WHERE CASE WHEN is_account_group_member('analysts') THEN device_id < 30 ELSE TRUE END; -- Re-issue the grant -- GRANT SELECT ON VIEW agg_heartrate to analysts REVOKE example - answerREVOKE EXECUTE ON FUNCTION mask FROM analysts GRANT USAGE on SCHEMA example - answerGRANT USAGE ON SCHEMA patient_gold TO analysts storage credential - answerestablish a credential to access the cloud file store where the table data will live in data explorer external location - answerUnity Catalog also provides a wrapper known as an external location that additionally specifies a path within the storage container. Using external locations for access control is preferred approach, since it gives us control at the file path level rather than at the level of the storage container itself. CREATE OR REPLACE TABLE silver_external LOCATION '' AS SELECT * FROM silver_managed Grant access to files - answerREAD FILES: ability to directly read files stored in this location WRITE FILES: ability to directly write files stored in this location CREATE TABLE: ability to create a table based on files stored in this location UC Region constraints - answerone metastore per region delta sharing can get around this but read-only

-Job Scheduling Account attached to control plane - answerDatabricks Cloud Account /What components are in the data plane and what account is attached - answer-Data Sources -DBFS Customer Cloud Account All-purpose Clusters retain stats - answerConfiguration information retained for up to 70 clusters for up to 30 days Job Clusters retain stats - answerConfiguration information retained for up to 30 most recently terminated clusters Which of the following pieces of information must be configured in the user settings of a workspace to integrate a Git service provider with a Databricks Repo? Select two responses. - answer-Username for Git service provider account -Personal Access Token Which of the following describes the key benefits of why migrating from a data lake to a data lakehouse will be beneficial for the data engineering team? - answerData lakehouses are able to support data quality solutions like ACID-compliant transactions. Data lakehouses are able to improve query performance by managing metadata and utilizing advanced data partitioning techniques. Which of the following operations are supported by Databricks Repos? Select two responses. - answerClone Pull Which of the following statements about querying tables defined against external sources is true? Select one response. - answerWhen defining tables or queries against external data sources, the performance guarantees associated with Delta Lake and Lakehouse cannot be guaranteed. A data engineer needs to query a JSON file whose location is represented by the variable path. - answerSELECT * FROM json.${path}; Accessing data in event_type StructType - answerThey can use. syntax to access id in event_type.

A data engineer is registering a table in Databricks using the table users from an external SQL database. - answerCREATE TABLE users_jdbcUSING JDBCOPTIONS ( url = "jdbc:sqlite:${DA.paths.ecommerce_db}", dbtable = "users") A data engineer wants to extract lines as raw strings from a text file. - answerSELECT * FROM text.${dbfs:/mnt/datasets}/001.txt; The data engineer can rotate the data from long to wide format using - answerthe .pivot() function. accessing values in json - answeruse : syntax to access date in events_df. What can be eliminated from: @udf("string") def first_letter_function(email: str) -> str: return email[0] first_letter_udf = spark.udf.register("sql_udf", first_letter_function) - answerThey can eliminate the statement that registers the function. CREATE TABLE events (user_id string, event_name string, item_id string, events struct); return a new DataFrame that shows the unique collection of the columns event_name and items. - answerarray_distinct(flatten(collect_set("events.event_id"))).alias("event_history") Refresh the table emails syntax - answerREFRESH TABLE emails; Query a Single File / Query a Directory of Files - answerSELECT * FROM file_format./path/to/file SELECT * FROM json.${DA.paths.kafka_events} Create References to Files - answerCREATE OR REPLACE VIEW event_view AS SELECT * FROM json.${DA.paths.kafka_events} Create Temporary References to Files - answerCREATE OR REPLACE TEMP VIEW events_temp_view AS SELECT * FROM json.${DA.paths.kafka_events} Extract the Raw Bytes and Metadata of a File - answerSELECT * FROM binaryFile.$ {DA.paths.kafka_events}

Databricks Associate Questions with Solutions, Exams of Advanced Education

Related documents

Partial preview of the text

Download Databricks Associate Questions with Solutions and more Exams Advanced Education in PDF only on Docsity!

Databricks Associate

Push back into the spark config so that we can use it in a later query.

SELECT

CASE WHEN