



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Databricks Associate Questions with Solutions
Typology: Exams
1 / 27
This page cannot be seen from the preview
Don't miss anything!




















Bronze - answer Typically just a raw copy of ingested data Replaces traditional data lake Provides efficient storage and querying of full, unprocessed history of data Silver - answer Filtered, Cleaned, Augmented Reduces data storage complexity, latency, and redundancy Optimizes ETL throughput and analytic query performance Preserves grain of original data (without aggregations) Eliminates duplicate records Production schema enforced Data quality checks, corrupt data quarantined Gold - answerPowers ML applications, reporting, dashboards, ad hoc analytics Refined views of data, typically with aggregations Reduces strain on production systems Optimizes query performance for business-critical data What is a Live Table? - answerLive Tables are materialized views for the lakehouse Defined by a SQL query • Created and kept up-to-date by a pipeline Live Table Create - answerCREATE OR REPLACE/REFRSH LIVE TABLE report AS SELECT sum(profit) FROM prod.sales What is a Streaming Live Table? - answerStreaming Live tables compute results over append-only streams such as Kafka, Kinesis, or Auto Loader (files on cloud storage)
Declare LIVE Dependencies (DLT) - answer• Dependencies owned by other producers are just read from the catalog or spark data source as normal.
Which of the approache can the data engineer take to accomplish this? Select one response. - answerThe data engineer can query events of type user_action from the event log. A data engineer wants to query metrics on the latest update made to their pipeline. They need to be able to see the event type and timestamp for each update. - answerThe data engineer can query the update ID from the events log where the event type is create_update. A data engineer wants to query metrics on the latest update made to their pipeline. The pipeline has multiple data sources. Despite the input data sources having low data retention, the data engineer needs to retain the results of the query indefinitely. - answerStreaming live table; streaming live tables can preserve data indefinitely. ) Which of the following correctly describes how to access contents of the table directory?
In SQL, the core logic of your query, containing transformations you make to your data, is contained in the SELECT statement. In Python, data transformations are specified when you configure options for @dlt.table(). Python vs SQL (DLT) Commenting - answerPython - @dlt.table(comment = "Python comment",table_properties = {"quality": "silver"}) SQL - COMMENT "SQL comment" TBLPROPERTIES ("quality" = "silver") This is how you add comments and table properties in Python vs. SQL Query Execution with Delta Lake - answerDLT uses Delta Lake to store all tables, each time a query is executed, we will always return the most recent version of the table. But queries outside of DLT will return snapshot results from DLT tables, regardless of how they were defined. Directories for DLT - answerThe autoloader and checkpoint directories contain data used to manage incremental data processing with Structured Streaming. The system directory captures events associated with the pipeline. Querying Eent Log syntax - answerdisplay(spark.sql(f"SELECT * FROM delta.{DA.paths.storage_location}/system/events")) Event_Types Values - answercluster_resources planning_information flow_progress user_action create_update update_progress Event Log Raw - answerevent_log_path = f"{DA.paths.storage_location}/system/events" Set Latest Update ID (DLT Event Logs) - answerlatest_update_id = spark.sql(""" SELECT origin.update_id FROM event_log_raw WHERE event_type = 'create_update' ORDER BY timestamp DESC LIMIT 1""").first().update_id print(f"Latest Update ID: {latest_update_id}")
spark.conf.set('latest_update.id', latest_update_id) Perform Audit Logging (DLT Event Logs) - answerEvents related to running pipelines and editing configurations are captured as user_action.
Auto Loader - answerAuto Loader functionality to provide optimized execution for incrementally loading data from cloud object storage into Delta Lake Auto Loader will automatically detect new data files as they land in the source cloud object storage location, incrementally processing new records without the need to perform expensive scans and recomputing results for infinitely growing datasets. cloud_files() - answerenables Auto Loader to be used natively with SQL Params: The source location, which should be cloud-based object storage The source data format, which is JSON in this case An arbitrarily sized comma-separated list of optional reader options. In this case, we set cloudFiles.inferColumnTypes to true Auto loader syntax example - answerCREATE OR REFRESH STREAMING LIVE TABLE orders_bronze AS SELECT current_timestamp() processing_time, input_file_name() source_file, * FROM cloud_files("${source}/orders", "json", map("cloudFiles.inferColumnTypes", "true")) DLT FROM clause has two constructs - answerThe LIVE keyword is used in place of the schema name to refer to the target schema configured for the current DLT pipeline The STREAM method allows users to declare a streaming data source for SQL queries TBLPROPERTIES - answerset the value silver for the key quality CREATE SILVER DLT TABLE - answerCREATE OR REFRESH STREAMING LIVE TABLE orders_silver (CONSTRAINT valid_date EXPECT (order_timestamp > "2021-01-01") ON VIOLATION FAIL UPDATE) COMMENT "Append only orders with valid timestamps" TBLPROPERTIES ("quality" = "silver") AS SELECT timestamp(order_timestamp) AS order_timestamp, * EXCEPT (order_timestamp, source_file, _rescued_data) FROM STREAM(LIVE.orders_bronze) Difference between Live Tables and Streaming Live Tables - answerLIVE: Always "correct", meaning their contents will match their definition after any update. Return same results as if table had just been defined for first time on all data. Should not be modified by operations external to the DLT Pipeline (you'll either get undefined answers or your change will just be undone). STREAMING LIVE:
Only supports reading from "append-only" streaming sources. Only reads each input batch once, no matter what (even if joined dimensions change, or if the query definition changes, etc). Can perform operations on the table outside the managed DLT Pipeline (append data, perform GDPR, etc). omits the Auto Loader option to infer schema - answerfields will have the correct names but will all be stored as STRING type. Create streaming table from autoloader json file - answerCREATE OR REFRESH STREAMING LIVE TABLE customers_bronze COMMENT "Raw data from customers CDC feed" AS SELECT current_timestamp() processing_time, input_file_name() source_file, * FROM cloud_files("${source}/customers", "json") APPLY CHANGES INTO - answerPerforms incremental/streaming ingestion of CDC data Provides simple syntax to specify one or many fields as the primary key for a table Default assumption is that rows will contain inserts and updates Can optionally apply deletes Automatically orders late-arriving records using user-provided sequencing key Uses a simple syntax for specifying columns to ignore with the EXCEPT keyword Will default to applying changes as Type 1 SCD data is being updated and deleted in table so it breaks - answerthe append-only requirements for streaming live table sources Views have the same update guarantees as live tables, but the results of queries - answerare not stored to disk. DLT views are not persisted to the metastore, meaning that they can only be referenced from within the DLT pipeline they are a part of Within a DLT Pipeline, code in any notebook library can reference - answertables and views created in any other notebook library. An approach that can work well is to fix one table at a time, - answerstarting at your earliest dataset and working toward your final @dlt.table - answeradded to any Python function that returns a Spark DataFrame. (NOTE: This includes Koalas DataFrames, ) @dlt.tabledef ():return () DLT Autoloader SYntax for Python - [email protected]
Python - name option for DLT table declaration - answerThe option allows developers to specify the name for the resultant table separate from the function definition that creates the DataFrame the table is defined from. @dlt.table( name = "customers_bronze", comment = "Raw data from customers CDC feed" ) def ingest_customers_cdc(): return ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load(f"{source}/customers") .select( F.current_timestamp().alias("processing_time"), F.input_file_name().alias("source_file"), "*" ) ) Python - DLT Validation Constraint syntax - [email protected] @dlt.expect_or_fail("valid_id", "customer_id IS NOT NULL") @dlt.expect_or_drop("valid_operation", "operation IS NOT NULL") @dlt.expect("valid_name", "name IS NOT NULL or operation = 'DELETE'") @dlt.expect("valid_adress", """ (address IS NOT NULL and city IS NOT NULL and state IS NOT NULL and zip_code IS NOT NULL) or operation = "DELETE" """) @dlt.expect_or_drop("valid_email", """ rlike(email, '^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$') or operation = "DELETE" """) def customers_bronze_clean(): return ( dlt.read_stream("customers_bronze") ) DLT Apply Changes Python Sytnax - answerdlt.create_target_table( name = "customers_silver")
dlt.apply_changes( target = "customers_silver", source = "customers_bronze_clean", keys = ["customer_id"], sequence_by = F.col("timestamp"), apply_as_deletes = F.expr("operation = 'DELETE'"), except_column_list = ["operation", "source_file", "_rescued_data"]) DLT View Python Sytnax - [email protected] def subscribed_order_emails_v(): return ( dlt.read("orders_silver").filter("notifications = 'Y'").alias("a") .join( dlt.read("customers_silver").alias("b"), on="customer_id" ).select( "a.customer_id", "a.order_id", "b.email" ) ) When errors can occur with DLT - answerInitializing or Setting Up Stage dlt.read_stream() - answerThe dlt.read_stream() method is meant only for use if you're using Delta Live Tables (DLT) to create your ETL/ELT pipeline. A data engineer has built and deployed a DLT pipeline. They want to perform an update that writes a batch of data to the output directory. - answerWith each triggered update, all newly arriving data will be processed through their pipeline. Metrics will always be reported for the current run. Workflows - answeris a fully-managed cloud based general purpose task orchestration service for the entire Lakehouse. Workflows is a service for data engineers, data scientists and analysts to build reliable data, analytics and AI workflows on any cloud. Databricks Workflows has two main task orchestration services; - answerWorkflow Jobs (Workflows): Workflows for every job.
Databricks Unity Catalog - answerUnify governance across clouds Fine-grained governance for data lakes across clouds - based on open standard ANSI SQL Unify data and AI assets Centrally share, audit, secure and manage all data types with one simple interface. Unify existing catalogs Works in concert with existing data, storage, and catalogs - no hard migration require Metastore - answerControl Plane + Cloud Storage Unity Catalog three-level - answernamespace Table Types - answer-External (unmanged) -Managed Lowest level of UC hierarchy - answer-Table -View -Function Level under metastore in UC Heirarchy - answer-Storage Credentials -External Location -Catalog -Share -Recipient UC Architecture - answerUser/group management | Metastore | Access controls ^ | Workspace 1 || Workspace 2 Cloud Administrator - answerAdministers underlying cloud resources • Storage accounts/buckets • IAM role/service principals/Managed Identities Identity Administrator - answerAdministers underlying identity provider service (when in use) • Identity provider provisions users and groups into the account • Avoids need to manually create and manage identities Account Administrator - answerAdministers the account • Creates, deletes, and assigns metastores to workspaces • Creates, deletes, and assigns users and groups to workspaces • Integrates account with an identity provider • Full access to all data objects
Metastore Administrator - answerAdministers a metastore • Creates and drops catalogs and other data objects • Grants privileges on data objects • Changes ownerships of data objects • Designated by an account administrator Data Owner - answerOwns data objects they created • Creates nested objects • Grants privileges to others on owned objects • Changes ownership of owned objects Workspace Administrator - answerAdministers a workspace • Manages permissions on workspace assets • Restricts access to cluster creation • Adds or removes users • Elevates users permissions • Grant privileges to others • Change job ownership Workspace Identities - answer-User -Account admin -Service Principal (w/ or w/o admin privelages) -Groups (w/ or w/o nesting groups) UC Security Model - Query Lifecycle - answer1) Send Query (Principal)
is_account_group_member() - answerreturns TRUE if the user querying the view is a member of the specified group is_member() - answerplease refrain from using the legacy function is_member(), which references workspace-level groups. This is not good practice in the context of Unity Catalog. Dynamic views create statement - answerCREATE OR REPLACE VIEW agg_heartrate AS SELECT CASE WHEN is_account_group_member('analysts') THEN 'REDACTED' ELSE mrn END AS mrn, CASE WHEN is_account_group_member('analysts') THEN 'REDACTED' ELSE name END AS name, MEAN(heartrate) avg_heartrate, DATE_TRUNC("DD", time) date FROM heartrate_device GROUP BY mrn, name, DATE_TRUNC("DD", time); -- Re-issue the grant -- GRANT SELECT ON VIEW agg_heartrate to analysts CREATE OR REPLACE VIEW agg_heartrate AS SELECT mrn, time,x device_id, heartrate FROM heartrate_device WHERE CASE WHEN is_account_group_member('analysts') THEN device_id < 30 ELSE TRUE END; -- Re-issue the grant -- GRANT SELECT ON VIEW agg_heartrate to analysts mask() - answeruser-defined function that we created earlier to mask the mrn column for our analaysts Dynamic views create statement (masking) - answerCREATE OR REPLACE VIEW agg_heartrate AS
is_account_group_member('analysts') THEN mask(mrn) ELSE mrn END AS mrn, time, device_id, heartrate FROM heartrate_device WHERE CASE WHEN is_account_group_member('analysts') THEN device_id < 30 ELSE TRUE END; -- Re-issue the grant -- GRANT SELECT ON VIEW agg_heartrate to analysts REVOKE example - answerREVOKE EXECUTE ON FUNCTION mask FROM analysts GRANT USAGE on SCHEMA example - answerGRANT USAGE ON SCHEMA patient_gold TO analysts storage credential - answerestablish a credential to access the cloud file store where the table data will live in data explorer external location - answerUnity Catalog also provides a wrapper known as an external location that additionally specifies a path within the storage container. Using external locations for access control is preferred approach, since it gives us control at the file path level rather than at the level of the storage container itself. CREATE OR REPLACE TABLE silver_external LOCATION '' AS SELECT * FROM silver_managed Grant access to files - answerREAD FILES: ability to directly read files stored in this location WRITE FILES: ability to directly write files stored in this location CREATE TABLE: ability to create a table based on files stored in this location UC Region constraints - answerone metastore per region delta sharing can get around this but read-only
-Job Scheduling Account attached to control plane - answerDatabricks Cloud Account /What components are in the data plane and what account is attached - answer-Data Sources -DBFS Customer Cloud Account All-purpose Clusters retain stats - answerConfiguration information retained for up to 70 clusters for up to 30 days Job Clusters retain stats - answerConfiguration information retained for up to 30 most recently terminated clusters Which of the following pieces of information must be configured in the user settings of a workspace to integrate a Git service provider with a Databricks Repo? Select two responses. - answer-Username for Git service provider account -Personal Access Token Which of the following describes the key benefits of why migrating from a data lake to a data lakehouse will be beneficial for the data engineering team? - answerData lakehouses are able to support data quality solutions like ACID-compliant transactions. Data lakehouses are able to improve query performance by managing metadata and utilizing advanced data partitioning techniques. Which of the following operations are supported by Databricks Repos? Select two responses. - answerClone Pull Which of the following statements about querying tables defined against external sources is true? Select one response. - answerWhen defining tables or queries against external data sources, the performance guarantees associated with Delta Lake and Lakehouse cannot be guaranteed. A data engineer needs to query a JSON file whose location is represented by the variable path. - answerSELECT * FROM json.${path}; Accessing data in event_type StructType - answerThey can use. syntax to access id in event_type.
A data engineer is registering a table in Databricks using the table users from an external SQL database. - answerCREATE TABLE users_jdbcUSING JDBCOPTIONS ( url = "jdbc:sqlite:${DA.paths.ecommerce_db}", dbtable = "users") A data engineer wants to extract lines as raw strings from a text file. - answerSELECT * FROM text.${dbfs:/mnt/datasets}/001.txt; The data engineer can rotate the data from long to wide format using - answerthe .pivot() function. accessing values in json - answeruse : syntax to access date in events_df. What can be eliminated from: @udf("string") def first_letter_function(email: str) -> str: return email[0] first_letter_udf = spark.udf.register("sql_udf", first_letter_function) - answerThey can eliminate the statement that registers the function. CREATE TABLE events (user_id string, event_name string, item_id string, events struct); return a new DataFrame that shows the unique collection of the columns event_name and items. - answerarray_distinct(flatten(collect_set("events.event_id"))).alias("event_history") Refresh the table emails syntax - answerREFRESH TABLE emails; Query a Single File / Query a Directory of Files - answerSELECT * FROM file_format./path/to/file SELECT * FROM json.${DA.paths.kafka_events} Create References to Files - answerCREATE OR REPLACE VIEW event_view AS SELECT * FROM json.${DA.paths.kafka_events} Create Temporary References to Files - answerCREATE OR REPLACE TEMP VIEW events_temp_view AS SELECT * FROM json.${DA.paths.kafka_events} Extract the Raw Bytes and Metadata of a File - answerSELECT * FROM binaryFile.$ {DA.paths.kafka_events}