Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

DATABRICKS EXAM 2024/2025 WITH 100% ACCURATE SOLUTIONS, Exams of Nursing

DATABRICKS EXAM 2024/2025 WITH 100% ACCURATE SOLUTIONS

Typology: Exams

2024/2025

Available from 09/03/2024

ACADEMICLINKS
ACADEMICLINKS 🇺🇸

3.9

(10)

3.4K documents

1 / 47

Toggle sidebar

Related documents


Partial preview of the text

Download DATABRICKS EXAM 2024/2025 WITH 100% ACCURATE SOLUTIONS and more Exams Nursing in PDF only on Docsity!

DATABRICKS EXAM 2024/2025 WITH

100% ACCURATE SOLUTIONS

What are the two main components of the Databricks Architecture - Precise Answer ✔✔1. The control plane consists of the backend services that Databricks manages in its own cloud account.

  1. The data plane is where the data is processed. three different Databricks services - Precise Answer ✔✔Databricks Data Science and Engineering workspace Databricks SQL Databricks Machine Learning What is a cluster? - Precise Answer ✔✔Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads What are the two cluster types? - Precise Answer ✔✔All purpose clusters analyse data collaboratively using interactive notebooks. You can create all-purpose clusters from the Workspace or through the command line interface, or the REST APIs that Databricks provides. You can terminate and restart an all-purpose cluster. Multiple users can share all-purpose clusters to do collaborative interactive analysis. Jobs clusters run automated jobs in an expeditious and robust way. The Databricks Job scheduler creates job clusters when you run Jobs and terminates them when the associated Job is complete. You cannot restart a job cluster. These properties ensure an isolated execution environment for each and every Job. What are the three cluster modes? - Precise Answer ✔✔1. Standard clusters are ideal for processing large amounts of data with Apache Spark.
  1. Single Node clusters are intended for jobs that use small amounts of data or non-distributed workloads such as single-node machine learning libraries.
  2. High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs. Administrators usually create High Concurrency clusters. Databricks recommends enabling autoscaling for High Concurrency clusters. What is cluster size and autoscaling? - Precise Answer ✔✔When you create a Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. What is local disk encryption? - Precise Answer ✔✔Databricks generates an encryption key locally that is unique to each cluster node and is used to encrypt all data stored on local disks. What are the cluster security modes? - Precise Answer ✔✔None: No isolation. Does not enforce workspace-local table access control or credential passthrough. Cannot access Unity Catalog data. Single User: Can be used only by a single user Automated jobs should use single-user clusters. User Isolation: Can be shared by multiple users. Only SQL workloads are supported. Library installation, init scripts, and DBFS FUSE mounts are disabled to enforce strict isolation among the cluster users What is a pool? - Precise Answer ✔✔To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances, for the driver and worker nodes. The cluster is created using instances in the pools. If a pool does not have sufficient idle resources to create the requested driver or worker nodes, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used are returned to the pools and can be reused by a different cluster. What happens when you terminate / restart / delete a cluster? - Precise Answer ✔✔When a cluster terminates (i.e. stops): all resources previously associated with the compute environment will be completely removed. Cluster configuration settings are maintained,

The Restart button allows us to manually restart our cluster. The Delete button will stop our cluster and remove the cluster configuration. Which magic command do you use to run a notebook from another notebook? - Precise Answer ✔✔ %run ../Includes/Classroom-Setup-1. What is Databricks utilities and how can you use it to list out directories of files from Python cells? - Precise Answer ✔✔Databricks notebooks provide a number of utility commands for configuring and interacting with the environment display(dbutils.fs.ls("/databricks-datasets")) What function should you use when you have tabular data returned by a Python cell? - Precise Answer ✔✔display() What is the definition of a Delta Lake? - Precise Answer ✔✔Delta Lake is the technology at the heart of the Databricks Lakehouse platform. It is an open source technology that enables building a data lakehouse on top of existing storage systems. While Delta Lake was initially developed exclusively by Databricks, it's been open sourced for almost 3 years. Delta Lake builds upon standard data formats like Parquet and JSON. Delta Lake is optimized for cloud object storage. Delta Lake is built for scalable metadata handling.

How does Delta Lake address the data lake pain points to ensure reliable, ready-to-go data? - Precise Answer ✔✔ACID Transactions Schema Management Scalable Metadata Handling Unified Batch and Streaming Data Data Versioning and Time Travel Describe how Delta Lake brings ACID transactions to object storage - Precise Answer ✔✔Difficult to append data: Delta Lake provides guaranteed consistency for the state at the time an append begins, as well as atomic transactions and high durability. As such, appends will not fail due to conflict, even when writing from many data sources simultaneously. Difficult to modify existing data: upserts allow us to apply updates and deletes with simple syntax as a single atomic transaction. Jobs failing mid way: changes won't be committed until a job has succeeded. Jobs will either fail or succeed completely. Real time operations are not easy: Delta Lake allows atomic micro batched transaction processing in near real time through a tight integration with Structured Streaming, meaning that you can use both real time and batch operations in the same set of Delta Lake tables. Costly to keep historical data versions: the transaction logs used to guarantee atomicity, consistency and isolation, allow snapshot queries which easily enables time travel on your Delta Lake tables. Is Delta Lake the default for all tables created in Databricks? - Precise Answer ✔✔Yes What data objects are in the Databricks Lakehouse? - Precise Answer ✔✔Catalog: a grouping of databases. Database or schema: a grouping of objects in a catalog. Databases contain tables, views, and functions.

Table: a collection of rows and columns stored as data files in object storage. View: a saved query typically against one or more tables or data sources. Function: saved logic that returns a scalar value or set of rows. What is a metastore? - Precise Answer ✔✔The metastore contains all of the metadata that defines data objects in the lakehouse. Databricks provides the following metastore options: Unity Catalog: you can create a metastore to store and share metadata across multiple Databricks workspaces. Unity Catalog is managed at the account level. Hive metastore: Databricks stores all the metadata for the built-in Hive metastore as a managed service. An instance of the metastore deploys to each cluster and securely accesses metadata from a central repository for each customer workspace. External metastore: you can also bring your own metastore to Databricks What is a catalog? - Precise Answer ✔✔A catalog is the highest abstraction (or coarsest grain) in the Databricks Lakehouse relational model. Every database will be associated with a catalog. Catalogs exist as objects within a metastore. Before the introduction of Unity Catalog, Databricks used a two-tier namespace. Catalogs are the third tier in the Unity Catalog namespacing model: catalog_name.database_name.table_name What is a Delta Lake table? - Precise Answer ✔✔A Databricks table is a collection of structured data. A Delta table stores data as a directory of files on cloud object storage and registers table metadata to the metastore within a catalog and schema. As Delta Lake is the default storage provider for tables created in Databricks, all tables created in Databricks are Delta tables, by default.

Because Delta tables store data in cloud object storage and provide references to data through a metastore, users across an organization can access data using their preferred APIs; on Databricks, this includes SQL, Python, PySpark, Scala, and R How do relational objects work in Delta Live Tables? - Precise Answer ✔✔Delta Live Tables uses the concept of a "virtual schema" during logic planning and execution. Delta Live Tables can interact with other databases in your Databricks environment, and Delta Live Tables can publish and persist tables for querying elsewhere by specifying a target database in the pipeline configuration settings. All tables created in Delta Live Tables are Delta tables, and can be declared as either managed or unmanaged tables. While views can be declared in Delta Live Tables, these should be thought of as temporary views scoped to the pipeline. Temporary tables in Delta Live Tables are a unique concept: these tables persist data to storage but do not publish data to the target database. Some operations, such as APPLY CHANGES INTO , will register both a table and view to the database; the table name will begin with an underscore ( _ ) and the view will have the table name declared as the target of the APPLY CHANGES INTO operation. The view queries the corresponding hidden table to materialize the results. What is the syntax to create a Delta Table? - Precise Answer ✔✔CREATE TABLE IF NOT EXISTS students (id INT, name STRING, value DOUBLE) What is the syntax to insert data? - Precise Answer ✔✔INSERT INTO students VALUES (4, "Ted", 4.7), (5, "Tiffany", 5.5), (6, "Vini", 6.3);

Are concurrent reads on Delta Lake tables possible? - Precise Answer ✔✔concurrent reads on Delta Lake tables is limited only by the hard limits of object storage on cloud vendors. What is the syntax to update particular records of a table? - Precise Answer ✔✔UPDATE students SET value = value + 1 WHERE name LIKE "T%" What is the syntax to update particular records of a table? - Precise Answer ✔✔UPDATE students SET value = value + 1 WHERE name LIKE "T%" What is the syntax to delete particular records of a table? - Precise Answer ✔✔DELETE FROM students WHERE value > 6 What is the syntax for merge and what are the benefits of using merge? - Precise Answer ✔✔If you write 3 statements, one each to insert, update, and delete records, this would result in 3 separate transactions; if any of these transactions were to fail, it might leave our data in an invalid state. MERGE INTO table_a a USING table_b b ON a.col_name=b.col_name WHEN MATCHED AND b.col = X THEN UPDATE SET * WHEN MATCHED AND a.col = Y THEN DELETE WHEN NOT MATCHED AND b.col = Z THEN INSERT * What is the syntax to delete a table? - Precise Answer ✔✔DROP TABLE students What is Hive? - Precise Answer ✔✔Databricks uses a Hive metastore by default to register databases, tables, and views. Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale, allowing users to read, write, and manage petabytes of data using SQL. What are the two commands to see metadata about a table? - Precise Answer ✔✔DESCRIBE EXTENDED students DESCRIBE DETAIL students What is the syntax to display the Delta Lake files? - Precise Answer ✔✔DESCRIBE DETAIL students

%python display(dbutils.fs.ls(f"{DA.paths.user_db}/students")) Describe the Delta Lake files, their format and directory structure - Precise Answer ✔✔Records in Delta Lake tables are stored as data in Parquet files. Transactions to Delta Lake tables are recorded in the _delta_log. It has all the metadata about what Parquet files are currently valid. Each transaction results in a new JSON file being written to the Delta Lake transaction log. What does the query engine do using the transaction logs when we query a Delta Lake table? - Precise Answer ✔✔Rather than overwriting or immediately deleting files containing changed data, Delta Lake uses the transaction log to indicate whether or not files are valid in a current version of the table. When we query a Delta Lake table, the query engine uses the transaction logs to resolve all the files that are valid in the current version, and ignores all other data files What commands do you use to compact small files and index tables? - Precise Answer ✔✔Using the OPTIMIZE command allows you to combine files toward an optimal size (scaled based on the size of the table). It will replace existing data files by combining records and rewriting the results. When executing OPTIMIZE , users can optionally specify one or several fields for ZORDER indexing. It speeds up data retrieval when filtering on provided fields by colocating data with similar values within data files. How do you review a history of table transactions? - Precise Answer ✔✔DESCRIBE HISTORY students How do you query and roll back to previous table version? - Precise Answer ✔✔SELECT * FROM students VERSION AS OF 3 RESTORE TABLE students TO VERSION AS OF 8 What command do you use to clean up stale data files and what are the consequences of using this command? - Precise Answer ✔✔VACUUM students

When running VACUUM and deleting files, we permanently remove access to versions of the table that require these files to materialize. Using Delta Cache - Precise Answer ✔✔Using the Delta cache is an excellent way to optimize performance. Note: The Delta cache is not the same as caching in Apache Spark. One notable difference is that the Delta cache is stored entirely on the local disk, so that memory is not taken away from other operations within Spark. When enabled, the Delta cache automatically creates a copy of a remote file in local storage so that successive reads are significantly sped up. What is the syntax to create a database with default location (no location specified)? - Precise Answer ✔✔CREATE DATABASE IF NOT EXISTS db_name_default_location; What is the syntax to create a database with specified location? - Precise Answer ✔✔CREATE DATABASE IF NOT EXISTS db_name_custom_location LOCATION 'path/db_name_custom_location.db'; How do you get metadata information of a database? Where are the databases located (difference between default vs custom location) - Precise Answer ✔✔DESCRIBE DATABASE EXTENDED db_name; How do you get metadata information of a database? Where are the databases located (difference between default vs custom location) - Precise Answer ✔✔DESCRIBE DATABASE EXTENDED db_name; Default location is under dbfs:/user/hive/warehouse/ and the database directory is the name of the database with the .db extension, so so it is dbfs:/user/hive/warehouse/db_name.db. This is a directory that our database is tied to. Whereas the location of the database with custom location is in the directory specified after the LOCATION keyword. What's the best practice when creating databases? - Precise Answer ✔✔Generally speaking, it's going to be best practice to declare a location for a given database. This ensures that you know exactly where all of the data and tables are going to be stored. What is the syntax for creating a table in a database with default location and inserting data? What is the syntax for a table in a database with custom location? - Precise Answer ✔✔USE db_name_default_location; CREATE OR REPLACE TABLE managed_table_in_db_with_default_location

(width INT, length INT, height INT); INSERT INTO managed_table_in_db_with_default_location VALUES (3, 2, 1); SELECT * FROM managed_table_in_db_with_default_location; Same syntax when creating a table in the database with custom location and inserting data. python syntax: df.write.saveAsTable("table_name") Where are managed tables located in a database and how can you find their location? - Precise Answer ✔✔Databricks manages both the metadata and the data for a managed table. Managed tables are the default when creating a table. The data for a managed table resides in the LOCATION of the database it is registered to. This managed relationship between the data location and the database means that in order to move a managed table to a new database, you must rewrite all data to the new location. You can use this command to find a table location within the database, both for default and custom location. DESCRIBE EXTENDED managed_table_in_db; What is the syntax to create an external table? - Precise Answer ✔✔Databricks only manages the metadata for unmanaged (external) tables; when you drop a table, you do not affect the underlying data. Unmanaged tables will always specify a LOCATION during table creation; you can either register an existing directory of data files as a table or provide a path when a table is first defined. USE db_name_default_location; CREATE OR REPLACE TEMPORARY VIEW temp_delays USING CSV OPTIONS ( path = '$ {da.paths.working_dir}/flights/departuredelays.csv', header = "true", mode = "FAILFAST" -- abort file parsing with a RuntimeException if any malformed lines are encountered ); CREATE OR REPLACE TABLE external_table LOCATION 'path/external_table' AS SELECT * FROM temp_delays; SELECT * FROM external_table; What happens when you drop tables (difference between a managed and an unmanaged table)? - Precise Answer ✔✔For managed tables, when dropping the table, the table's directory and its log and data files will be deleted, only the database directory remains.

Databricks only manages the metadata for unmanaged (external) tables; when you drop a table, you do not affect the underlying data. Because data and metadata are managed independently, you can rename a table or register it to a new database without needing to move any data. What is the command to drop the database and its underlying tables and views? - Precise Answer ✔✔DROP DATABASE db_name_default_location CASCADE; How can you show a list of tables and views? - Precise Answer ✔✔SHOW TABLES; What is the difference between Views, Temp Views & Global Temp Views? - Precise Answer ✔✔View: Persisted as an object in a database. Persisted across multiple sessions, just like a table. You can query views from any part of the Databricks product (permissions allowing). Creating a view does not process or write any data; only the query text (i.e. the logic) is registered to the metastore in the associated database against the source. Temporary View: Limited scope and persistence and is not registered to a schema or catalog. In notebooks and jobs, temp views are scoped to the notebook or script level. Cannot be referenced outside of the notebook in which they are declared, and will no longer exist when the notebook detaches from the cluster. Global Temporary Views: Global temporary views are scoped to the cluster level and can be shared between notebooks or jobs that share computing resources. They are registered to a separate database, the global temp database (rather than our declared database), so it won't show up in our list of tables associated with our declared database. This database lives as part of the cluster, and as long as the cluster is on, then that global temp view will be available from any Spark session that connects to that cluster, or notebook attached to that cluster. Global temp views are lost when the cluster is restarted. What is the syntax for creating views? - Precise Answer ✔✔CREATE VIEW view_delays_abq_lax AS SELECT * FROM external_table WHERE origin = 'ABQ' AND destination = 'LAX'; SELECT * FROM view_delays_abq_lax; What is the syntax for creating temporary views? - Precise Answer ✔✔CREATE TEMPORARY VIEW temp_view_delays_gt_120 AS SELECT * FROM external_table WHERE delay > 120 ORDER BY delay ASC; SELECT * FROM temp_view_delays_gt_120;

What is the syntax for creating global temporary views? - Precise Answer ✔✔CREATE GLOBAL TEMPORARY VIEW global_temp_view_dist_gt_1000 AS SELECT * FROM external_table WHERE distance > 1000; SELECT * FROM global_temp.global_temp_view_dist_gt_1000; Note the global_temp database qualifer in the subsequent SELECT statement. Do views create underlying files? - Precise Answer ✔✔No. Creating a view does not process or write any data; only the query text (i.e. the logic) is registered to the metastore in the associated database against the source. Where are global temp views created? - Precise Answer ✔✔They are registered to a separate database, the global temp database (rather than our declared database), so it won't show up in our list of tables associated with our declared database. This database lives as part of the cluster, and as long as the cluster is on, then that global temp view will be available from any Spark session that connects to that cluster, or notebook attached to the cluster. What is the syntax to select from global temp views? - Precise Answer ✔✔SELECT * FROM global_temp.name_of_the_global_temp_view; What are CTEs? What is the syntax? - Precise Answer ✔✔The CTE only lasts for the duration of the query. It helps making the code more readable. WITH cte_table AS ( SELECT col1, col2, col FROM external_table WHERE col1 = X GROUP BY col ) SELECT * FROM cte_table WHERE col1 > X AND col2 = Y; What is the syntax to make multiple column aliases using a CTE? - Precise Answer ✔✔WITH flight_delays( total_delay_time, origin_airport, destination_airport )

AS(

SELECT delay, origin, destination FROM external_table ) SELECT * FROM flight_delays WHERE total_delay_time > 120 AND origin_airport = "ATL" AND destination_airport = "DEN"; What is the syntax for defining a CTE in a CTE? - Precise Answer ✔✔WITH lax_bos AS ( WITH origin_destination (origin_airport, destination_airport) AS ( SELECT origin, destination FROM external_table ) SELECT * FROM origin_destination WHERE origin_airport = 'LAX' AND destination_airport = 'BOS' ) SELECT count(origin_airport) AS Total Flights from LAX to BOS FROM lax_bos; What is the syntax for defining a CTE in a subquery? - Precise Answer ✔✔SELECT max(total_delay) AS Longest Delay (in minutes) FROM ( WITH delayed_flights(total_delay) AS ( SELECT delay FROM external_table

)

SELECT *

FROM delayed_flights ); What is the syntax for defining a CTE in a subquery expression - Precise Answer ✔✔SELECT ( WITH distinct_origins AS ( SELECT DISTINCT origin FROM external_table ) SELECT count(origin) AS Number of Distinct Origins FROM distinct_origins ) AS Number of Different Origin Airports; What is the syntax for defining a CTE in a CREATE VIEW statement? - Precise Answer ✔✔CREATE OR REPLACE VIEW BOS_LAX AS WITH origin_destination(origin_airport, destination_airport) AS ( SELECT origin, destination FROM external_table ) SELECT * FROM origin_destination WHERE origin_airport = 'BOS' AND destination_airport = 'LAX'; SELECT count(origin_airport ) AS Number of Delayed Flights from BOS to LAX FROM BOS_LAX;

How do you query data from a single file? - Precise Answer ✔✔SELECT * FROM file_format./path/to/file SELECT * FROM json.${da.paths.datasets}/raw/events-kafka/001.json How do you create references to files? - Precise Answer ✔✔CREATE OR REPLACE TEMP VIEW events_temp_view AS SELECT * FROM json.${da.paths.datasets}/raw/events-kafka/; SELECT * FROM events_temp_view How do you extract text files as raw strings? - Precise Answer ✔✔When working with text-based files (which include JSON, CSV, TSV, and TXT formats), you can use the text format to load each line of the file as a row with one string column named value. This can be useful when data sources are prone to corruption and custom text parsing functions will be used to extract value from text fields. SELECT * FROM text.${da.paths.datasets}/raw/events-kafka/ How do you extract the raw bytes and metadata of a file? What is a typical use case for this? - Precise Answer ✔✔Using binaryFile to query a directory will provide: file metadata alongside the binary representation of the file contents. Specifically, the fields created will indicate the path , modificationTime , length , and content. SELECT * FROM binaryFile.${da.paths.datasets}/raw/events-kafka/ Explain why executing a direct query against CSV files rarely returns the desired result. - Precise Answer ✔✔CSV files are one of the most common file formats, but, unlike JSON or Parquet, it's not a self describing file format. Executing a direct query against these files rarely returns the desired results. When executing a direct query, the header row can be extracted as a table row, all columns can be loaded as a single column, and the column can contain nested data that is being truncated. Describe the syntax required to extract data from most formats against external sources. - Precise Answer ✔✔CREATE TABLE table_identifier (col_name1 col_type1, ...) USING data_source OPTIONS (key1 = "val1", key2 = "val2", ...) LOCATION = path

The cell below demonstrates using Spark SQL DDL to create a table against an external CSV source. CREATE TABLE sales_csv (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING) USING CSV OPTIONS ( header = "true", delimiter = "|" ) LOCATION "${da.paths.working_dir}/sales-csv" What happens to the data, metadata and options during table declaration for these external sources? - Precise Answer ✔✔Note that no data has moved during table declaration. Similar to when we directly queried our files and created a view, we are still just pointing to files stored in an external location. All the metadata and options passed during table declaration will be persisted to the metastore, ensuring that data in the location will always be read with these options Does the column order matter if additional csv data files are added to the source directory at a later stage? - Precise Answer ✔✔When working with CSV s as a data source, it's important to ensure that column order does not change if additional data files will be added to the source directory. Because the data format does not have strong schema enforcement, Spark will load columns and apply column names and data types in the order specified during table declaration. What is the syntax to show all of the metadata associated with the table definition? - Precise Answer ✔✔DESCRIBE EXTENDED sales_csv What are the limits of tables with external data sources? - Precise Answer ✔✔Whenever we're defining tables or queries against external data sources, we cannot expect the performance guarantees associated with Delta Lake and Lakehouse. For example, while Delta Lake tables will guarantee that you always query the most recent version of your source data, tables registered against other data sources may represent older cached versions How can you manually refresh the cache of your data? - Precise Answer ✔✔REFRESH TABLE sales_csv

What is the syntax to extract data from SQL Databases? - Precise Answer ✔✔CREATE TABLE USING JDBC OPTIONS ( url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}", dbtable = "{jdbcDatabase}.table", user = "{jdbcUsername}", password = "{jdbcPassword}" ) Explain the two basic approaches that Spark uses to interact with external SQL databases and their limits

  • Precise Answer ✔✔You can move the entire source table(s) to Databricks and then executing logic on the currently active cluster. However, this can incur significant overhead because of network transfer latency associated with moving all data over the public internet You can push down the query to the external SQL database and only transfer the results back to Databricks. However, this can incur significant overhead because the execution of query logic in source systems not optimized for big data queries. What is a CTAS statement and what is the syntax? - Precise Answer ✔✔CREATE TABLE AS SELECT statements create and populate Delta tables using data retrieved from an input query. CREATE OR REPLACE TABLE sales AS SELECT * FROM parquet.${da.paths.datasets}/raw/sales-historical/; DESCRIBE EXTENDED sales; Do CTAS support manual schema declaration? - Precise Answer ✔✔CTAS statements automatically infer schema information from query results and do not support manual schema declaration. This means that CTAS statements are useful for external data ingestion from sources with well-defined schema, such as Parquet files and tables. CTAS statements also do not support specifying additional file options. This presents significant limitations when trying to ingest data from CSV files. What is the syntax to overcome the limitation when trying to ingest data from CSV files? - Precise Answer ✔✔CREATE OR REPLACE TEMP VIEW sales_tmp_vw (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING) USING CSV OPTIONS (

path = "${da.paths.datasets}/raw/sales-csv", header = "true", delimiter = "|" ); CREATE TABLE sales_delta AS SELECT * FROM sales_tmp_vw; SELECT * FROM sales_delta How do you filter and rename columns from existing tables during table creation - Precise Answer ✔✔CREATE OR REPLACE TABLE purchases AS SELECT order_id AS id, transaction_timestamp, purchase_revenue_in_usd AS price FROM sales; SELECT * FROM purchases What is a generated column and how do you declare schemas with generated columns? - Precise Answer ✔✔CREATE OR REPLACE TABLE purchase_dates ( id STRING, transaction_timestamp STRING, price STRING, date DATE GENERATED ALWAYS AS ( cast(cast(transaction_timestamp/1e6 AS TIMESTAMP) AS DATE)) COMMENT "generated based on transactions_timestamp column") What are the two types of table constraints and how do you display them? - Precise Answer ✔✔Databricks currently support two types of constraints: NOT NULL constraints, and CHECK constraints ALTER TABLE purchase_dates ADD CONSTRAINT valid_date CHECK (date > '2020-01-01');

Which built-in Spark SQL commands are useful for file ingestion (for the select clause)? - Precise Answer ✔✔current_timestamp() records the timestamp when the logic is executed; input_file_name() records the source data file for each record in the table What are the three options when creating tables? - Precise Answer ✔✔CREATE OR REPLACE TABLE users_pii COMMENT "Contains PII" LOCATION "${da.paths.working_dir}/tmp/users_pii" PARTITIONED BY (first_touch_date) AS SELECT *, cast(cast(user_first_touch_timestamp/1e6 AS TIMESTAMP) AS DATE) first_touch_date, current_timestamp() updated, input_file_name() source_file FROM parquet.${da.paths.datasets}/raw/users-historical/; SELECT * FROM users_pii; As a best practice, should you default to partitioned tables for most use cases when working with Delta Lake? - Precise Answer ✔✔Most Delta Lake tables (especially small-to-medium sized data) will not benefit from partitioning. Because partitioning physically separates data files, this approach can result in a small files problem and prevent file compaction and efficient data skipping. What are the two options to copy Delta Lake tables and what are the use cases? - Precise Answer ✔✔DEEP CLONE fully copies data and metadata from a source table to a target. This copy occurs incrementally, so executing this command again can sync changes from the source to the target location. Because all the data files must be copied over, this can take quite a while for large datasets. CREATE OR REPLACE TABLE purchases_clone DEEP CLONE purchases If you wish to create a copy of a table quickly to test out applying changes without the risk of modifying the current table, SHALLOW CLONE can be a good option. SHALLOW CLONE just copIES the Delta

transaction logs, meaning that the data doesn't move. CREATE OR REPLACE TABLE purchases_shallow_clone SHALLOW CLONE purchases What are the multiple benefits of overwriting tables instead of deleting and recreating tables? - Precise Answer ✔✔Overwriting a table is much faster because it doesn't need to list the directory recursively or delete any files The old version of the table still exists and can be easily retrieved using Time Travel It's an atomic operation. Concurrent queries can still read the table while you are deleting the table Due to ACID transaction guarantees, if overwriting the table fails, the table will be in its previous state. What are the two easy methods to accomplish complete overwrites? - Precise Answer ✔✔CREATE OR REPLACE TABLE events AS SELECT * FROM parquet.${da.paths.datasets}/raw/events-historical INSERT OVERWRITE sales SELECT * FROM parquet.${da.paths.datasets}/raw/sales-historical/ INSERT OVERWRITE provides a nearly identical outcome as above: data in the target table will be replaced by data from the query. However, INSERT OVERWRITE can only overwrite an existing table, not create a new one like our CRAS statement; INSERT OVERWRITE can overwrite only with new records that match the current table schema (and thus can be a "safer" technique for overwriting an existing table without disrupting downstream consumers). What is the syntax to atomically append new rows to an existing Delta table? Is the command idempotent? - Precise Answer ✔✔INSERT INTO sales SELECT * FROM parquet.${da.paths.datasets}/raw/sales-30m INSERT INTO does not have any built-in guarantees to prevent inserting the same records multiple times. Re-executing the above cell would write the same records to the target table, resulting in duplicate records

What is the syntax for the the MERGE SQL operation and the benefits of using merge? - Precise Answer ✔✔MERGE INTO target a USING source b ON {merge_condition} WHEN MATCHED THEN {matched_action} WHEN NOT MATCHED THEN {not_matched_action} The main benefits of MERGE are:

  1. updates, inserts, and deletes are completed as a single transaction;
  2. multiple conditions can be added in addition to matching fields;
  3. it provides extensive options for implementing custom logic What is the syntax to have an idempotent option to incrementally ingest data from external systems? - Precise Answer ✔✔COPY INTO provides SQL engineers an idempotent option to incrementally ingest data from external systems COPY INTO sales FROM "${da.paths.datasets}/raw/sales-30m" FILEFORMAT = PARQUET How is COPY INTO different than Auto Loader - Precise Answer ✔✔COPY INTO and Auto Loader are different. Very similar functionality but focused on a SQL analyst doing a batch execution, whereas Auto Loader requires Structured Streaming. They're similar, but different technologies. Do COUNT and DISTINCT queries skip or count nulls? - Precise Answer ✔✔Counts Nulls:
  4. COUNT(*) > special case that counts the total number of rows, including rows that are only NULL values
  5. DISTINCT(*) > The presence of NULL is also taken as a DISTINCT record
  6. DISTINCT(col_name) > The presence of NULL is also taken as a DISTINCT record Skips Null:
  7. COUNT(col_name)
  8. COUNT(DISTINCT(*)) > we count distinct values without NULL because NULL is not something we can count
  9. COUNT(DISTINCT(col_name)) > we count distinct values without NULL because NULL is not something we can count

What is the syntax to count null values? - Precise Answer ✔✔SELECT * FROM table_name WHERE col_name IS NULL SELECT count_if(col_name IS NULL) AS new_col_name FROM table_name What is the syntax to count for distinct values in a table for a specific column? - Precise Answer ✔✔SELECT COUNT(DISTINCT(col_1)) FROM table_name WHERE col_1 IS NOT NULL What is the syntax to cast a column to valid timestamp? - Precise Answer ✔✔SELECT datetime(col_name "HH:mm:ss") AS new_col_name FROM table_name SELECT CAST(col_name_with_transformation AS timestamp) AS new_col_name What is the syntax for regex? - Precise Answer ✔✔SELECT regexp_extract(string_to_search , "regex_to_match", optional_match_portion_to_be_returned) AS email_domain) FROM table_name What is the syntax to deal with binary-encoded JSON values in a human readable format? - Precise Answer ✔✔CREATE OR REPLACE TEMP VIEW events_strings AS SELECT string(key), string(value) FROM events_raw; SELECT * FROM events_strings What is the Spark SQL functionality to directly interact with JSON data stored as strings? - Precise Answer ✔✔SELECT value:device, value:geo:city FROM events_strings What are struct types? What is the syntax to parse JSON objects into struct types with Spark SQL? - Precise Answer ✔✔Spark SQL also has the ability to parse JSON objects into struct types (a native Spark type with nested attributes) by using a from_json function. However, this from_json function requires a schema. To derive the schema of our the data, you can take a row example with no null fields, and use Spark SQL's schema_of_json function. CREATE OR REPLACE TEMP VIEW parsed_events AS SELECT from_json(value, schema_of_json('{insert_example_schema_here}')) AS json FROM events_strings; SELECT * FROM parsed_events

JSON string is unpacked to a struct type, what is the syntax to flatten the fields into columns? What is the syntax to interact with the subfields in a struct type? - Precise Answer ✔✔CREATE OR REPLACE TEMP VIEW new_events_final AS SELECT json.* FROM parsed_events; SELECT * FROM new_events_final What is the syntax to deal with nested struct types? - Precise Answer ✔✔SELECT ecommerce.purchase_revenue_in_usd FROM events WHERE ecommerce.purchase_revenue_in_usd IS NOT NULL What is the syntax for exploding arrays of structs? - Precise Answer ✔✔The items field in the to deal with arrays. The events table is an array of structs. Spark SQL has a number of functions specifically explode function lets us put each element in an array on its own row. SELECT user_id, event_timestamp, event_name, explode(items) AS item FROM events What is the syntax to collect arrays? - Precise Answer ✔✔1. The collect_set function can collect unique values for a field, including fields within arrays.

  1. The flatten function allows multiple arrays to be combined into a single array.
  2. The array_distinct function removes duplicate elements from an array. We combine these queries to create a simple table that shows the unique collection of actions and the items in a user's cart. SELECT user_id, collect_set(event_name) AS event_history, array_distinct(flatten(collect_set(items.item_id))) AS cart_history FROM events GROUP BY user_id What is the syntax for an INNER JOIN? - Precise Answer ✔✔CREATE OR REPLACE VIEW sales_enriched AS SELECT * FROM ( SELECT *, explode(items) AS item FROM sales) a INNER JOIN item_lookup b ON a.item.item_id = b.item_id; SELECT * FROM sales_enriched What is the syntax for an outer join? - Precise Answer ✔✔SELECT * FROM employee FULL OUTER JOIN department ON employee.DepartmentID = department.DepartmentID; What is the syntax for an anti-join? - Precise Answer ✔✔SELECT

FROM customers a LEFT JOIN customer_success_engineer b ON a.assigned_cse_id = b.cse_id WHERE TRUE AND b.cse_id IS NULL What is the syntax for a cross-join? - Precise Answer ✔✔SELECT ColumnName_1, ColumnName_2, ColumnName_N FROM [Table_1] CROSS JOIN [Table_2] What is the syntax for a semi-join? - Precise Answer ✔✔SELECT columns FROM table_1 WHERE EXISTS ( SELECT values FROM table_2 WHERE table_2.column = table_1.column ); What is the syntax for the Spark SQL UNION , MINUS , and INTERSECT set operators? - Precise Answer ✔✔-- all rows SELECT * FROM events UNION SELECT * FROM new_events_final -- rows from events minus those in new_events_final

SELECT * FROM events MINUS SELECT * FROM new_events_final -- only matching rows SELECT * FROM events INTERSECT SELECT * FROM new_events_final What is the syntax for pivot tables? - Precise Answer ✔✔SELECT * FROM ( SELECT email, order_id, transaction_timestamp, total_item_quantity, purchase_revenue_in_usd, unique_items, item.item_id AS item_id, item.quantity AS quantity FROM sales_enriched ) PIVOT ( sum(quantity) FOR item_id in ( 'P_FOAM_K', 'M_STAN_Q', 'P_FOAM_S', 'M_PREM_Q', 'M_STAN_F' ) );