DATABRICKS - DATA ENGINEER ASSOCIATE EXAM 2 2024/2025, Exams of Nursing

DATABRICKS - DATA ENGINEER ASSOCIATE EXAM 2 2024/2025

Typology: Exams

2024/2025

Available from 09/03/2024

answerhub
answerhub 🇺🇸

4

(16)

5.4K documents

1 / 43

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
DATABRICKS - DATA ENGINEER
ASSOCIATE EXAM 2 2024/2025
What is the best way to describe a data lakehouse compared to a data warehouse?
a. A data lakehouse provides a relational system of data management
b. A data lakehouse captures snapshots of data for version control purposes.
c. A data lakehouse couples storage and compute for complete control.
d. A data lakehouse utilizes proprietary storage formats for data.
e. A data lakehouse enables both batch and streaming analytics. - Precise Answer ✔✔e. A data
lakehouse enables both batch and streaming analytics.
Explanation
Answer is A data lakehouse enables both batch and streaming analytics.
A lakehouse has the following key features:
Transaction support: In an enterprise lakehouse many data pipelines will often be reading and writing
data concurrently. Support for ACID transactions ensures consistency as multiple parties concurrently
read or write data, typically using SQL.
Schema enforcement and governance: The Lakehouse should have a way to support schema
enforcement and evolution, supporting DW schema architectures such as star/snowflake-schemas. The
system should be able to reason about data integrity, and it should have robust governance and auditing
mechanisms.
BI support: Lakehouses enable using BI tools directly on the source data. This reduces staleness and
improves recency, reduces latency, and lowers the cost of having to operationalize two copies of the
data in both a data lake and a warehouse.
Storage is decoupled from compute: In practice this means storage and compute use separate clusters,
thus these systems are able to scale to many more concurrent users and larger data sizes. Some modern
data warehouses also have this property.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b

Partial preview of the text

Download DATABRICKS - DATA ENGINEER ASSOCIATE EXAM 2 2024/2025 and more Exams Nursing in PDF only on Docsity!

DATABRICKS - DATA ENGINEER

ASSOCIATE EXAM 2 2024/

What is the best way to describe a data lakehouse compared to a data warehouse? a. A data lakehouse provides a relational system of data management b. A data lakehouse captures snapshots of data for version control purposes. c. A data lakehouse couples storage and compute for complete control. d. A data lakehouse utilizes proprietary storage formats for data. e. A data lakehouse enables both batch and streaming analytics. - Precise Answer ✔✔e. A data lakehouse enables both batch and streaming analytics. Explanation Answer is A data lakehouse enables both batch and streaming analytics. A lakehouse has the following key features: Transaction support: In an enterprise lakehouse many data pipelines will often be reading and writing data concurrently. Support for ACID transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL. Schema enforcement and governance: The Lakehouse should have a way to support schema enforcement and evolution, supporting DW schema architectures such as star/snowflake-schemas. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms. BI support: Lakehouses enable using BI tools directly on the source data. This reduces staleness and improves recency, reduces latency, and lowers the cost of having to operationalize two copies of the data in both a data lake and a warehouse. Storage is decoupled from compute: In practice this means storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property.

Openness: The storage formats they use are open and standardized, such as Parquet, and they provide an API so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly. Support for diverse data types ranging from unstructured to structured data: The lakehouse can be used to store, refine, analyze, and access data types needed for many new data applications, including images, video, audio, semi-structured data, and text. Support for diverse workloads: including data science, machine learning, and SQL and analytics. Multiple tools might be needed to s You are designing an analytical to store structured data from your e-commerce platform and unstructured data from website traffic and app store, how would you approach where you store this data? a. Use traditional data warehouse for structured data and use data lakehouse for unstructured data. b. Data lakehouse can only store unstructured data but cannot enforce a schema c. Data lakehouse can store structured and unstructured data and can enforce schema d. Traditional data warehouses are good for storing structured data and enforcing schema - Precise Answer ✔✔c. Data lakehouse can store structured and unstructured data and can enforce schema explanation The answer is, Data lakehouse can store structured and unstructured data and can enforce schema What Is a Lakehouse? - The Databricks Blog You are currently working on a production job failure with a job set up in job clusters due to a data issue, what cluster do you need to start to investigate and analyze the data? a. A Job cluster can be used to analyze the problem

explanation Answer is Databricks Repos can commit or push code changes to trigger a CI/CD process See below diagram to understand the role Databricks Repos and Git provider plays when building a CI/CD workdlow. All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in Gray are done in a git provider like Github or Azure Devops. You noticed that colleague is manually copying the notebook with _bkp to store the previous versions, which of the following feature would you recommend instead. a. Databricks notebooks support change tracking and versioning b. Databricks notebooks should be copied to a local machine and setup source control locally to version the notebooks c. Databricks notebooks can be exported into dbc archive files and stored in data lake d. Databricks notebook can be exported as HTML and imported at a later time - Precise Answer ✔✔a. Databricks notebooks support change tracking and versioning Explanation Answer is Databricks notebooks support automatic change tracking and versioning. When you are editing the notebook on the right side check version history to view all the changes, every change you are making is captured and saved. Newly joined data analyst requested read-only access to tables, assuming you are owner/admin which section of Databricks platform is going to facilitate granting select access to the user ____.

a. Admin console b. User settings c. Data explorer d. Azure Databricks control pane IAM e. Azure RBAC - Precise Answer ✔✔c. Data explorer Explanation Answer is Data Explorer https://docs.databricks.com/sql/user/data/index.html Data explorer lets you easily explore and manage permissions on databases and tables. Users can view schema details, preview sample data, and see table details and properties. Administrators can view and change owners, and admins and data object owners can grant and revoke permissions. To open data explorer, click Data in the sidebar. How does a Delta Lake differ from a traditional data lake? a. Delta lake is Data warehouse service on top of data lake that can provide reliability, security, and performance b. Delta lake is a caching layer on top of data lake that can provide reliability, security, and performance c. Delta lake is an open storage format like parquet with additional capabilities that can provide reliability, security, and performance d. Delta lake is an open storage format designed to replace flat files with additional capabilities that can provide reliability, security, and performance

b. CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) FORMAT DELTA c. CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) d. CREATE TABLE USING DELTA transactions ( transactionId int, transactionDate timestamp, unitsSold int) e. CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) LOCATION DELTA - Precise Answer ✔✔c. CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int)

Explanation Answer is CREATE TABLE transactions ( transactionId int, transactionDate timestamp, unitsSold int) When creating a table in Databricks by default the table is stored in DELTA format. Which of the following is a correct statement on how the data is organized in the storage when when managing a DELTA table? a. All of the data is broken down into one or many parquet files, log files are broken down into one or many JSON files, and each transaction creates a new data file(s) and log file. b. All of the data and log are stored in a single parquet file c. All of the data is broken down into one or many parquet files, but the log file is stored as a single json file, and every transaction creates a new data file(s) and log file gets appended. d. All of the data is broken down into one or many parquet files, log file is removed once the transaction is committed. e. All of the data is stored into one parquet file, log files are broken down into one or many json files. - Precise Answer ✔✔a. All of the data is broken down into one or many parquet files, log files are broken down into one or many JSON files, and each transaction creates a new data file(s) and log file. Explanation Answer is

Explanation The answer is AUTO LOADER, Use Auto Loader instead of the COPY INTO SQL command when: You want to load data from a file location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently than the COPY INTO SQL command and can split file processing into multiple batches. COPY INTO only directory listing but AUTO LOADER supports File notification method where the Auto Loader continues to ingest files as they arrive in cloud object storage leveraging cloud provider(Queues and triggers) and Spark's structured streaming. Your data schema evolves frequently. Auto Loader provides better support for schema inference and evolution. See Configuring schema inference and evolution in Auto Loader. At the end of the inventory process, a file gets uploaded to the cloud object storage, you are asked to build a process to ingest data which of the following method can be used to ingest the data incrementally, schema of the file is expected to change overtime ingestion process should be able to handle these changes automatically. Below is the auto loader to command to load the data, fill in the blanks for successful execution of below code. spark.readStream .format("cloudfiles") .option("_______","csv) .option("_______", 'dbfs:/location/checkpoint/') .load(data_source) .writeStream .option("_______",' dbfs:/location/checkpoint/') .option("_______", "true") .table(table_name)) a. format, checkpointlocation, schemalocation, overwrite

b. cloudfiles.format, checkpointlocation, cloudfiles.schemalocation, overwrite c. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema d. cloudfiles.form - Precise Answer ✔✔c. cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema xplanation The answer is cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema. Here is the end to end syntax of streaming ELT, below link contains complete options Auto Loader options | Databricks on AWS spark.readStream .format("cloudfiles") # Returns a stream data source, reads data as it arrives based on the trigger. .option("cloudfiles.format","csv") # Format of the incoming files .option("cloudfiles.schemalocation", "dbfs:/location/checkpoint/") The location to store the inferred schema and subsequent changes .load(data_source) .writeStream .option("checkpointlocation","dbfs:/location/checkpoint/") # The location of the stream's checkpoint .option("mergeSchema", "true") # Infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. .table(table_name)) # target table What is the purpose of the bronze layer in a Multi-hop architecture?

e. Optimized query performance for business-critical data - Precise Answer ✔✔c. A schema is enforced, with data quality checks. Explanation The answer is, A schema is enforced, with data quality checks. Medallion Architecture - Databricks Silver Layer: Reduces data storage complexity, latency, and redundency Optimizes ETL throughput and analytic query performance Preserves grain of original data (without aggregation) Eliminates duplicate records production schema enforced Data quality checks, quarantine corrupt data Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose. Sorry I had to add the watermark some people in Udemy are copying my content. What is the purpose of a gold layer in Multi-hop architecture? a. Optimizes ETL throughput and analytic query performance b. Eliminate duplicate records c. Preserves grain of original data, without any aggregations d. Data quality checks and schema enforcement e. Powers ML applications, reporting, dashboards and adhoc reports. - Precise Answer ✔✔e. Powers ML applications, reporting, dashboards and adhoc reports. Explanation

The answer is Powers ML applications, reporting, dashboards and adhoc reports. Review the below link for more info, Medallion Architecture - Databricks Gold Layer: Powers Ml applications, reporting, dashboards, ad hoc analytics Refined views of data, typically with aggregations Reduces strain on production systems Optimizes query performance for business-critical data Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose. Sorry I had to add the watermark some people in Udemy are copying my content. You are currently asked to work on building a data pipeline, you have noticed that you are currently working on a very large scale ETL many data dependencies, which of the following tools can be used to address this problem? a. AUTO LOADER b. JOBS and TASKS c. SQL Endpoints d. DELTA LIVE TABLES e. STRUCTURED STREAMING with MULTI HOP - Precise Answer ✔✔d. DELTA LIVE TABLES xplanation The answer is, DELTA LIVE TABLES DLT simplifies data dependencies by building DAG-based joins between live tables. Here is a view of how the dag looks with data dependencies without additional meta data,

Impossible to trace data lineage Difficult pipeline operations Poor observability at granular data level Error handling and recovery is laborious How do you create a delta live tables pipeline and deploy using DLT UI? a. Within the Workspace UI, click on Workflows, select Delta Live tables and create a pipeline and select the notebook with DLT code. b. Under Cluster UI, select SPARK UI and select Structured Streaming and click create pipeline and select the notebook with DLT code. c. There is no UI, you can only setup DELTA LIVE TABLES using Python and SQL API and select the notebook with DLT code. d. Use VS Code and download DBX plugin, once the plugin is loaded you can build DLT pipelines and select the notebook with DLT code. e. Within the Workspace UI, click on SQL Endpoint, select Delta Live tables and create pipeline and select the notebook with DLT code. - Precise Answer ✔✔a. Within the Workspace UI, click on Workflows, select Delta Live tables and create a pipeline and select the notebook with DLT code. Explanation The answer is, Within the Workspace UI, click on Workflows, select Delta Live tables and create a pipeline and select the notebook with DLT code. https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-quickstart.html You are noticing job cluster is taking 6 to 8 mins to start which is delaying your job to finish on time, what steps you can take to reduce the amount of time cluster startup time?

a. Setup a second job ahead of first job to start the cluster, so the cluster is ready with resources when the job starts b. Use All purpose cluster instead to reduce cluster start up time c. Reduce the size of the cluster, smaller the cluster size shorter it takes to start the cluster d. Use cluster pools to reduce the startup time of the jobs e. Use SQL endpoints to reduce the startup time - Precise Answer ✔✔d. Use cluster pools to reduce the startup time of the jobs Explanation The answer is, Use cluster pools to reduce the startup time of the jobs. Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost incurred is Azure. Databricks run time cost is only billed once VM is allocated to a cluster. Here is a demo of how to setup and follow some best practices, https://www.youtube.com/watch?v=FVtITxOabxg&ab_channel=DatabricksAcademy Data engineering team has a job currently setup to run a task load data into a reporting table every day at 8: 00 AM takes about 20 mins, Operations teams are planning to use that data to run a second job, so they access latest complete set of data. What is the best to way to orchestrate this job setup? a. Add Operation reporting task in the same job and set the Data Engineering task to depend on Operations reporting task b. Setup a second job to run at 8:20 AM in the same workspace

The answer is, Modify the task, to include time out to kill the job if it runs more than 15 mins. https://docs.microsoft.com/en-us/azure/databricks/data-engineering/jobs/jobs#timeout Which of the following programming languages can be used to build a Databricks SQL dashboard? a. Python b. Scala c. SQL d. R e. All of the above - Precise Answer ✔✔c. SQL Here is the example query --- Get order summary create or replace table orders_summary as select product_id, sum(order_count) order_count from ( select product_id,order_count from orders_instore union all select product_id,order_count from orders_online ) group by product_id -- get supply summary create or repalce tabe supply_summary as select product_id, sum(supply_count) supply_count from supply group by product_id

-- get on hand based on orders summary and supply summary with stock_cte as ( select nvl(s.product_id,o.product_id) as product_id, nvl(supply_count,0) - nvl(order_count,0) as on_hand from supply_summary s full outer join orders_summary o on s.product_id = o.product_id ) select * from stock_cte where on_hand = 0 The data analyst team had put together queries that identify items that are out of stock based on orders and replenishment but when they run all together for final output the team noticed it ta - Precise Answer ✔✔c. Increase the cluster size of the SQL endpoint. xplanation The answer is to increase the cluster size of the SQL Endpoint, here queries are running sequentially and since the single query can not span more than one cluster adding more clusters won't improve the query but rather increasing the cluster size will improve performance so it can use additional compute in a warehouse. In the exam please note that additional context will not be given instead you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(more nodes) if the queries are running concurrently (more users) then scale out(more clusters). Below is the snippet from Azure, as you can see by increasing the cluster size you are able to add more worker nodes.