








































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Databricks Fundamentals Ultimate Exam introduces learners to the core concepts, tools, and workflows used within the Databricks platform. This exam covers basic data engineering principles, workspace navigation, notebooks, Apache Spark foundations, collaborative analytics, cloud integration, and introductory machine learning concepts. Designed for beginners and aspiring data professionals, the Ultimate Exam provides structured practice questions and practical examples that help users understand the fundamentals of data processing and analytics in a modern lakehouse environment.
Typology: Exams
1 / 48
This page cannot be seen from the preview
Don't miss anything!









































Question 1. Which component of the Databricks Lakehouse architecture is responsible for managing clusters, notebooks, and jobs? A) Data Plane B) Control Plane C) Cloud Service Layer D) Storage Layer Answer: B Explanation: The Control Plane hosts the web UI, cluster manager, and job scheduler, separating management functions from data processing which occurs in the Data Plane. Question 2. In the Lakehouse model, what primary advantage does decoupling storage from compute provide? A) Automatic schema enforcement B) Unlimited data versioning C) Independent scaling of resources D) Built-in AI recommendations Answer: C Explanation: Decoupling lets you scale compute up or down without moving data, optimizing cost and performance. Question 3. Which caching mechanism stores intermediate Spark RDD results on the driver’s local disk? A) Spark Cache B) Disk Cache C) Memory Cache D) GPU Cache Answer: B Explanation: Disk Cache writes Spark shuffle and broadcast data to local SSDs, whereas Spark Cache keeps data in JVM memory. Question 4. What folder does Delta Lake use to store its transaction log?
A) _delta_log B) _transaction_log C) _metadata D) _audit_log Answer: A Explanation: Delta Lake writes JSON and checkpoint files under the _delta_log directory to guarantee ACID transactions. Question 5. Which SQL clause enables time travel to view a previous version of a Delta table? A) AS OF VERSION B) FOR SYSTEM_TIME C) VERSIONED AS OF D) AT TIMESTAMP Answer: A Explanation: SELECT * FROM table VERSION AS OF 5 reads the table state at version 5. Question 6. When a new column is added to a Delta table without dropping existing data, which feature is being used? A) Schema Enforcement B) Schema Evolution C) Data Skipping D) Vacuuming Answer: B Explanation: Schema Evolution allows the table schema to evolve (e.g., add columns) while preserving existing rows. Question 7. What command permanently removes files that are no longer referenced by a Delta table? A) CLEANUP B) VACUUM
Explanation: Delta stores min/max values for each column per file, allowing the engine to skip irrelevant files. Question 11. In the Medallion Architecture, which layer typically contains raw, unprocessed data? A) Bronze B) Silver C) Gold D) Platinum Answer: A Explanation: The Bronze layer ingests data in its original format, preserving fidelity for downstream processing. Question 12. Which of the following best describes the Silver layer in Medallion Architecture? A) Raw ingestion only B) Cleaned, enriched, and de-duplicated data C) Aggregated business metrics D) Machine-learning model outputs Answer: B Explanation: Silver tables apply quality checks, joins, and transformations to produce curated datasets. Question 13. The Gold layer is primarily used for which purpose? A) Storing streaming checkpoints B) Providing BI-ready aggregates and dimensions C) Archiving old versions of data D) Hosting model training data Answer: B Explanation: Gold tables expose business-ready, aggregated data for reporting and analytics.
Question 14. Which Databricks feature enables automatic, incremental ingestion from cloud storage without writing custom code? A) COPY INTO B) Auto Loader (cloudFiles) C) DBFS Mount D) Spark Structured Streaming Answer: B Explanation: Auto Loader detects new files in cloud storage and loads them efficiently using schema inference and checkpointing. Question 15. What is the primary benefit of using the COPY INTO command? A) Real-time streaming ingestion B) Idempotent batch loads with automatic file discovery C) Automatic schema evolution D) Built-in data quality checks Answer: B Explanation: COPY INTO loads files once and tracks processed files, ensuring repeatable batch loads. Question 16. In Delta Live Tables, which construct defines a data quality rule that can halt pipeline execution on violation? A) EXPECTATIONS B) CONSTRAINTS C) VALIDATIONS D) CHECKPOINTS Answer: A Explanation: EXPECTATIONS let you declare quality constraints with actions like FAIL or DROP on violations. Question 17. Which DLT output mode creates a table that updates in place (upserts) based on primary key? A) APPEND
D) Workspace Explorer Answer: C Explanation: The Dashboard Builder lets you pin query results and visualizations into interactive dashboards. Question 21. What does the GenAI integration in Databricks SQL enable? A. Automatic schema inference B. Natural-language to SQL translation C. Real-time model serving D. Automated data cataloging Answer: B Explanation: GenAI lets users type questions in plain English, which are converted to SQL queries. Question 22. In Unity Catalog, what is the correct hierarchical order from highest to lowest level? A) Catalog > Metastore > Schema > Table B) Metastore > Catalog > Schema > Table C) Schema > Catalog > Metastore > Table D) Table > Schema > Catalog > Metastore Answer: B Explanation: A Metastore contains multiple Catalogs; each Catalog contains Schemas, which contain Tables/Views. Question 23. Which SQL command grants SELECT permission on a table to a group called analysts? A) GRANT SELECT ON TABLE sales TO GROUP analysts; B) GIVE SELECT sales TO analysts; C) PERMIT SELECT sales GROUP analysts; D) ALLOW SELECT ON sales GROUP analysts; Answer: A
Explanation: GRANT SELECT ON TABLE sales TO GROUP analysts; follows Unity Catalog’s syntax. Question 24. Row-level security in Unity Catalog can be implemented using which feature? A) Column Masking B) Dynamic Views C) Table Locks D) Data Encryption Answer: B Explanation: Dynamic Views filter rows based on the session user, providing row-level access control. Question 25. Column-level security in Unity Catalog is achieved through which mechanism? A) Row Filters B) Table Grants C) Column Masking Policies D) Data Skipping Answer: C Explanation: Masking policies define how column values are transformed or hidden for specific principals. Question 26. Which Unity Catalog object type can store both data files and Delta tables? A) External Table B) Managed Table C) File System D) View Answer: B Explanation: Managed Tables are fully owned by Unity Catalog; their data files reside in the default storage location.
A) Retry on Failure = true B) Max Retries setting C) Auto-Rerun flag D) Failure Callback Answer: B Explanation: The max_retries field defines how many times the platform retries a failed task. Question 31. Which notification channel can be configured directly from the Jobs UI for success/failure alerts? A) SMS B) Slack C) PagerDuty D) Webhook only Answer: B Explanation: Databricks Jobs UI supports Slack integration for real-time alerts. Question 32. In MLflow Tracking, which entity records hyperparameters, metrics, and artifacts for a single run? A) Experiment B) Run C) Model D) Registry Answer: B Explanation: A Run is the atomic unit that logs parameters, metrics, and output files. Question 33. Which stage in the MLflow Model Registry is intended for models that are ready for production deployment? A) Staging B) Production
C) Archived D) Draft Answer: B Explanation: The Production stage marks models that have passed validation and are serving live traffic. Question 34. What is the primary purpose of the Databricks Feature Store? A) Store raw binary files B) Manage reusable feature definitions and serve them to training and inference jobs C) Host model binaries for deployment D) Provide a UI for data visualization Answer: B Explanation: Feature Store centralizes feature engineering, ensuring consistency between training and serving. **Question 35. Mosaic AI in Databricks is primarily associated with which capability? ** A) Distributed graph processing B) Generative AI and LLM integration C) Real-time streaming ingestion D) Columnar storage compression Answer: B Explanation: Mosaic AI offers tools for building, fine-tuning, and serving large language models inside Databricks. Question 36. Which of the following statements about Delta Lake’s ACID compliance is true? A) It only guarantees atomicity for INSERT operations. B) Isolation is achieved through optimistic concurrency control. C) Consistency is enforced by external transaction managers. D) Durability is not provided for streaming workloads.
Explanation: CONVERT TO DELTA registers the existing Parquet files as a Delta table in place. Question 40. What is the effect of the OPTIMIZE … ZORDER BY (col1, col2) command? A) Deletes duplicate rows based on col1 and col2. B) Rewrites data files to cluster on the specified columns for faster range queries. C) Creates a secondary index stored in a separate location. D) Alters the table schema to add col1 and col2 as primary keys. Answer: B Explanation: Z-ORDER reorganizes file layout to colocate similar values of the listed columns, improving query pruning. Question 41. Which of the following best describes a “managed” Delta table in Unity Catalog? A) Its data files are stored outside the catalog’s storage root. B) The catalog fully controls the table’s lifecycle and storage location. C) It can only be accessed via SQL, not via Spark APIs. D) It does not support time travel. Answer: B Explanation: Managed tables are owned by Unity Catalog; creation, deletion, and storage are handled automatically. Question 42. Which Spark configuration controls the size of the disk cache for shuffle files? A) spark.local.dir B) spark.shuffle.file.buffer C) spark.storage.safetyFraction D) spark.shuffle.spill.compress Answer: B Explanation: spark.shuffle.file.buffer defines the amount of memory used for buffering shuffle output before it is written to disk.
Question 43. In a multi-task job, how can you ensure that Task B starts only after Task A completes successfully? A) Set Task B’s depends_on to Task A. B) Use a while loop in Task B’s notebook. C) Place both tasks in the same notebook cell. D) Set max_retries of Task B to 0. Answer: A Explanation: The depends_on relationship defines task execution order and success dependency. Question 44. Which of the following is NOT a valid reason to run VACUUM with a retention period of 0 hours? A) To immediately free storage after a massive delete. B) To comply with GDPR “right to be forgotten”. C) To delete data that might still be needed for time travel. D) To reduce the number of files for query performance. Answer: C Explanation: Setting retention to 0 removes all historical files, breaking time travel and potentially violating compliance requirements. Question 45. What does the MERGE INTO statement do in Delta Lake? A) Deletes duplicate rows. B) Performs upserts based on a join condition. C) Creates a new table from a SELECT query. D) Compacts small files into larger ones. Answer: B Explanation: MERGE matches source rows to target rows and inserts, updates, or deletes accordingly. Question 46. Which of the following is a benefit of using Serverless SQL Warehouses for ad-hoc analytics?
C) Streaming Trigger D) Auto-Refresh Answer: C Explanation: Setting STREAMING in DLT creates a streaming pipeline that processes new data as it arrives. Question 50. What is the default retention period for Delta Lake’s VACUUM command if not specified? A) 7 days B) 30 days C) 1 day D) 90 days Answer: B Explanation: By default, VACUUM retains files for 30 days to protect against accidental data loss. Question 51. Which of the following best explains “data skipping” in the context of Delta Lake? A) Skipping rows that fail schema validation. B) Avoiding reading files whose min/max statistics do not satisfy the query predicate. C) Ignoring partitions that are older than a retention period. D) Bypassing the transaction log for read-only queries. Answer: B Explanation: Data skipping uses file-level statistics to prune irrelevant files during query execution. Question 52. When you create a view in Unity Catalog that references tables across multiple catalogs, what permission must the user have? A) SELECT on each underlying table only. B) OWN on the view. C) USAGE on all involved catalogs.
D) ADMIN on the metastore. Answer: C Explanation: Users need USAGE on each catalog to resolve object references across catalog boundaries. Question 53. Which Spark API is used to read a Delta table as a DataFrame? A) spark.read.format("delta").load(path) B) spark.delta.readTable(name) C) spark.readDelta(path) D) spark.sql("SELECT * FROM delta.path") Answer: A Explanation: The DataFrameReader with format “delta” loads the table; option D also works but is SQL-based. Question 54. Which of the following is a recommended practice before running OPTIMIZE on a large table? A) Disable all concurrent writes. B) Increase the default shuffle partitions to 2000. C) Run VACUUM with a 0-hour retention. D) Set spark.databricks.delta.optimize.maxFileSize to 1 GB. Answer: A Explanation: Optimizing while writes are occurring can lead to file conflicts; pausing writes ensures a consistent compaction. Question 55. What does the STREAMING keyword do when defining a DLT pipeline? A) Enables automatic checkpointing. B) Forces the pipeline to run in batch mode. C) Allows the pipeline to ingest data from a Delta Live Table source only. D) Turns the pipeline into a continuous streaming job. Answer: D
Question 59. What is the purpose of the spark.databricks.delta.retentionDurationCheck.enabled configuration? A) To disable the 7-day minimum retention check for VACUUM. B) To enable automatic schema evolution. C) To enforce row-level security at query time. D) To turn on Z-ORDER indexing automatically. Answer: A Explanation: Setting this to false allows VACUUM with a retention period shorter than the default 7-day safety window. Question 60. Which of the following is NOT a supported source format for Auto Loader? A) JSON B) CSV C) Parquet D) Oracle Database Answer: D Explanation: Auto Loader works with files in cloud storage (JSON, CSV, Parquet, Avro, etc.), not directly with relational databases. Question 61. In a Delta table, what does a “tombstone” file represent? A) A checkpoint of the transaction log. B) Metadata for a deleted file that is still retained for time travel. C) A corrupted data file that needs repair. D) An index file for Z-ORDER. Answer: B Explanation: Tombstone files record deletions so that older versions can still be reconstructed. Question 62. Which command can be used to list all versions of a Delta table that are currently retained? A) DESCRIBE HISTORY table;
B) SHOW VERSIONS OF table; C) LIST LOGS FROM table; D) SELECT * FROM _delta_log; Answer: A Explanation: DESCRIBE HISTORY returns a history of commits (versions) still available. Question 63. Which of the following best describes the difference between a “managed” and an “external” table in Unity Catalog? A) Managed tables store data in the catalog’s default storage; external tables reference data outside the catalog. B) Managed tables cannot be versioned; external tables support time travel. C) Managed tables are read-only; external tables allow writes. D) Managed tables are only for streaming data; external tables are for batch data. Answer: A Explanation: Managed tables are fully owned by Unity Catalog; external tables point to data in external locations. Question 64. Which of the following is a valid way to create a SQL alert that triggers when a query’s result exceeds a threshold? A) CREATE ALERT IF SELECT COUNT(*) > 1000; B) CREATE ALERT ON QUERY SELECT … WHEN result > 1000; C) CREATE ALERT FROM QUERY ‘my_query’ WHEN result > 1000; D) CREATE ALERT USING CONDITION (SELECT …) > 1000; Answer: C Explanation: CREATE ALERT FROM QUERY 'my_query' WHEN result > 1000; is the correct syntax in Databricks SQL. Question 65. What does the mlflow.register_model API do? A) Saves a model as an artifact in the current run. B) Adds a model to the Model Registry under a given name. C) Deploys a model to a serving endpoint automatically.