










































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Databricks Machine Learning ML Associate Ultimate Exam is designed for learners seeking to build foundational and intermediate machine learning skills using Databricks technologies. Topics include feature engineering, supervised and unsupervised learning, model training, ML workflows, experiment tracking, data preparation, and model deployment concepts. Candidates will strengthen their understanding of machine learning pipelines and collaborative AI development using Databricks tools and Spark ML frameworks. This Ultimate Exam offers realistic practice questions, hands-on scenarios, and detailed answer explanations to support certification success and practical machine learning applications.
Typology: Exams
1 / 50
This page cannot be seen from the preview
Don't miss anything!











































Question 1. Which Databricks Runtime is pre-installed with Conda, TensorFlow, PyTorch, and XGBoost? A) Standard Runtime B) Databricks Runtime for Machine Learning C) Databricks Runtime for Genomics D) Databricks Light Runtime Answer: B Explanation: The Databricks Runtime for Machine Learning (ML Runtime) ships with a curated Conda environment that includes popular ML libraries such as TensorFlow, PyTorch, and XGBoost, enabling rapid model development. Question 2. When should you choose a Single-node cluster over a Standard cluster? A) When training a distributed gradient-boosted tree model on terabytes of data B) When running a quick prototype on a small CSV file C) When you need to run Spark SQL across multiple worker nodes D) When you must guarantee fault tolerance for streaming workloads Answer: B Explanation: Single-node clusters run all Spark executors on the driver machine and are ideal for lightweight, exploratory tasks on small datasets. Distributed workloads require Standard clusters. Question 3. Which command installs a Python library only for the current notebook session? A) %pip install pandas B) dbutils.library.installPyPI("pandas") C) spark.conf.set("spark.jars.packages", "pandas") D) %conda install pandas Answer: A Explanation: The %pip install magic installs the library into the notebook’s Python environment for the duration of that session, without affecting the whole cluster.
Question 4. In Databricks Repos, which Git operation is NOT directly supported from the UI? A) Clone a repository B) Create a new branch C) Rebase onto another branch D) Commit changes to the current branch Answer: C Explanation: Databricks Repos provides UI actions for cloning, branching, and committing, but advanced Git commands like rebasing must be performed via a local Git client or the terminal. Question 5. What is the primary purpose of a Databricks Job dependency? A) To enforce that two jobs run on the same cluster B) To guarantee that a downstream task starts only after its upstream task succeeds C) To share the same workspace directory between jobs D) To automatically merge the results of two jobs into one table Answer: B Explanation: Job dependencies define execution order, ensuring that a task runs only after the successful completion of its predecessor, which is essential for multi-step ML pipelines. Question 6. Which default metric does AutoML use for a binary classification problem when no metric is specified? A) Accuracy B) Area Under ROC (AUC) C) F1-score D) Log-loss Answer: D Explanation: Databricks AutoML defaults to Log-loss for binary classification because it captures probability calibration, which is more informative than raw accuracy for imbalanced data.
Explanation: VectorAssembler merges several numeric or categorical columns (already encoded) into a single vector column required by most Spark ML estimators. Question 10. When should you prefer One-Hot Encoding over String Indexing? A) When the categorical column has high cardinality (>1000 distinct values) B) When the downstream algorithm treats the feature as ordinal C) When the model cannot handle sparse vectors D) When the algorithm expects binary indicator columns for each category Answer: D Explanation: One-Hot Encoding creates a binary column per category, which is required by algorithms that cannot interpret integer-encoded categories as ordinal values. Question 11. Which technique is most appropriate for detecting outliers in a skewed numeric column? A) Z-score using mean and standard deviation B) Interquartile Range (IQR) method C) Removing values beyond the 99th percentile D) Applying a log transformation and then using Z-score Answer: B Explanation: The IQR method (Q1-1.5IQR, Q3+1.5IQR) is robust to skewness because it relies on quartiles rather than mean and standard deviation. Question 12. What is the effect of adding an imputation indicator column after filling missing values? A) It reduces the dimensionality of the dataset B) It allows the model to learn whether a value was originally missing C) It improves the convergence speed of gradient-based optimizers D) It automatically normalizes the imputed column Answer: B
Explanation: An indicator column flags rows where imputation occurred, giving the model information about missingness patterns, which can be predictive. Question 13. Which cross-validation strategy minimizes the risk of data leakage when time-series data is involved? A) Random K-fold CV B. Stratified K-fold CV C. Time-Series Split (forward chaining) D. Hold-out split with 70/30 ratio Answer: C Explanation: Time-Series Split respects temporal ordering, ensuring that training data always precedes validation data, thus avoiding leakage from future information. Question 14. In Spark ML, which component represents a learning algorithm that can be fit to data? A) Transformer B) Estimator C) PipelineStage D) Model Answer: B Explanation: An Estimator implements a fit() method that produces a Model (a Transformer) after training on data. Question 15. Which MLflow tracking API call logs a metric with a custom key and value? A) mlflow.log_param("accuracy", 0.95) B) mlflow.log_metric("accuracy", 0.95) C) mlflow.log_artifact("accuracy.txt") D) mlflow.start_run() Answer: B
Question 19. Which method is used to register a model version programmatically in the MLflow Registry? A) mlflow.register_model() B) mlflow.create_registered_model() C) mlflow.register_model(model_uri, "model_name") D) mlflow.client().create_model_version(name, source, run_id) Answer: D Explanation: The MLflowClient().create_model_version(name, source, run_id) call registers a new version of a model under the specified name. Question 20. In a batch inference pipeline, which approach yields the highest throughput when using a logged MLflow model? A) Loading the model in each loop iteration and calling predict on a single row B) Converting the model to a Spark UDF and applying it to the entire DataFrame C) Exporting the model to ONNX and using a separate inference engine D) Using mlflow.pyfunc.load_model inside a Pandas apply function Answer: B Explanation: Registering the model as a Spark UDF enables vectorized execution across partitions, dramatically increasing throughput compared to row-wise Python calls. Question 21. Which feature of Delta Live Tables (DLT) helps ensure that feature data used for training is consistent with the data used for scoring? A) Automatic schema evolution B) Expectation-based quality checks C) Table versioning and time travel D) Streaming source connectors Answer: C Explanation: Delta Lake’s versioning allows you to read the exact snapshot of a feature table that was used during training, guaranteeing reproducibility during scoring.
Question 22. When would you use a Pandas UDF of type “GROUPED_MAP” instead of “SCALAR”? A) When each input row must be transformed independently B) When you need to apply a model that requires the entire group’s data (e.g., time-series forecasting per key) C) When the function returns a scalar value for each row D) When you want to avoid Apache Arrow serialization Answer: B Explanation: GROUPED_MAP UDFs receive a Pandas DataFrame for each group, allowing group-wise operations such as applying a model that needs the full context of the group. Question 23. Which of the following is a correct way to enable automatic library installation for a Databricks Job cluster? A) Add the library to the workspace’s global init script B) Specify the library in the Job’s “Libraries” tab C) Include pip install commands inside the notebook code D) Use dbutils.library.restartPython() after installation Answer: B Explanation: The Job UI’s “Libraries” section lets you attach PyPI, Maven, or CRAN packages to the cluster that will be automatically installed before the job runs. Question 24. In the context of model serving, what does “Serverless Real-Time Inference” provide that a traditional Spark UDF does not? A) Ability to run inference on GPU clusters only B) Automatic scaling to zero when there are no requests, with low-latency HTTP endpoints C) Support for batch inference on Delta tables D) Integration with Databricks Repos for version control Answer: B
Question 28. When creating a Feature Table, which column type must be used for the primary key? A) StringType only B) Any primitive type (String, Integer, Long) as long as it is unique C) TimestampType only D) BinaryType Answer: B Explanation: The primary key can be any primitive type that uniquely identifies a row; Databricks enforces uniqueness but does not restrict the type. Question 29. Which of the following statements about mlflow.pyfunc.load_model() is FALSE? A) It returns a generic Python function that can predict on Pandas DataFrames B) It can load models logged with any supported ML framework (TensorFlow, Scikit-Learn, etc.) C) It automatically registers the model in the Model Registry upon loading D) The returned object implements a predict method Answer: C Explanation: Loading a model does not register it; registration requires explicit calls to the Model Registry API. Question 30. In a Databricks notebook, which magic command displays the contents of a DBFS file? A) %fs cat /path/to/file B) %sh cat /dbfs/path/to/file C) dbutils.fs.head("/path/to/file") D) %sql SELECT * FROM file Answer: A Explanation: %fs cat is the Databricks file system magic that prints the contents of a file stored in DBFS.
Question 31. Which hyperparameter tuning method is most sample-efficient for high-dimensional spaces? A) Exhaustive Grid Search B) Random Search C) Bayesian Optimization (e.g., Tree-Parzen Estimator) D) Manual tuning Answer: C Explanation: Bayesian Optimization models the performance surface and selects promising hyperparameter configurations, achieving good results with fewer trials in high-dimensional spaces. Question 32. What is the primary advantage of using mlflow.sklearn.log_model() over mlflow.pyfunc.log_model() for a Scikit-Learn model? A) It automatically logs feature importance charts B) It stores the model in a format native to Scikit-Learn, enabling direct loading with mlflow.sklearn.load_model C) It creates a REST endpoint automatically D) It compresses the model to reduce storage size Answer: B Explanation: mlflow.sklearn.log_model preserves the Scikit-Learn model’s native pickle format, allowing seamless loading with the corresponding Scikit-Learn API. Question 33. Which of the following is a correct way to read a Feature Store table into a Spark DataFrame? A) spark.read.format("featurestore").load("my_feature_table") B) feature_store.read_table("my_feature_table") C) fs = FeatureStoreClient(); df = fs.read_table("my_feature_table") D) dbutils.fs.read("feature://my_feature_table") Answer: C Explanation: The FeatureStoreClient provides the read_table method to retrieve a feature table as a Spark DataFrame.
Question 37. Which of the following is a valid reason to use the “archived” stage in the Model Registry? A) The model is the current production version B) The model has been deprecated and should not be used for inference C) The model is being tested in a staging environment D) The model is a draft awaiting validation Answer: B Explanation: “Archived” marks a model as retired; it remains stored for audit purposes but is excluded from serving pipelines. Question 38. In the context of Delta Lake, what does the OPTIMIZE command accomplish? A) Compacts small files into larger ones to improve read performance B) Deletes rows that match a given condition C) Updates the schema of a Delta table D) Enables automatic versioning of the table Answer: A Explanation: OPTIMIZE rewrites many small files into larger ones (data compaction), reducing the number of file reads and improving query performance. Question 39. Which of the following is NOT a built-in AutoML evaluation metric for forecasting tasks? A) Mean Absolute Percentage Error (MAPE) B) Symmetric Mean Absolute Percentage Error (sMAPE) C) Root Mean Squared Logarithmic Error (RMSLE) D) R² Answer: D Explanation: AutoML for time-series forecasting in Databricks provides MAPE, sMAPE, and RMSLE, but R² is not a default metric for forecasting.
Question 40. When using mlflow.start_run() with the nested=True argument, what is the effect? A) The new run overwrites any existing active run B) The new run becomes a child of the currently active run, allowing hierarchical logging C) The run is automatically transitioned to Production D) The run logs only parameters, not metrics Answer: B Explanation: nested=True creates a child run under the active parent run, enabling structured experiment tracking. Question 41. Which Spark configuration should be increased to improve the performance of a wide join operation? A) spark.sql.autoBroadcastJoinThreshold B) spark.sql.shuffle.partitions C) spark.sql.broadcastTimeout D) spark.sql.joinPreCacheSize Answer: A Explanation: Raising spark.sql.autoBroadcastJoinThreshold allows larger tables to be broadcast, reducing shuffle overhead for wide joins. Question 42. Which of the following is a correct way to convert a Pandas DataFrame to a Spark DataFrame using the pandas API on Spark? A) spark.createDataFrame(pandas_df) B) pandas_df.to_spark() C) ps.from_pandas(pandas_df) D) spark.pandas.from_pandas(pandas_df) Answer: C Explanation: pyspark.pandas.from_pandas (often imported as ps) converts a Pandas DataFrame into a pandas-on-Spark DataFrame that can be used in Spark jobs.
Question 46. Which metric is most suitable for evaluating a highly imbalanced binary classification model where false negatives are critical? A) Accuracy B) F1-score C) Recall D) ROC-AUC Answer: C Explanation: Recall (TP / (TP+FN)) directly measures the ability to capture positive cases, which is essential when missing a positive (false negative) is costly. Question 47. What does the mlflow.log_artifact() function do? A) Saves a model parameter to the tracking server B) Uploads a local file (e.g., plot, config) to the run’s artifact store C) Registers a model version automatically D) Deletes an existing artifact from the run Answer: B Explanation: log_artifact copies a file from the local filesystem into the run’s artifact storage, making it accessible through the UI. Question 48. Which of the following is a correct way to apply a trained Spark ML model to a streaming DataFrame? A) model.transform(streaming_df).writeStream.start() B) model.predict(streaming_df) C) model.apply(streaming_df).foreachBatch() D) model.fit(streaming_df) Answer: A Explanation: Spark ML models are Transformers; calling transform on a streaming DataFrame returns a new streaming DataFrame that can be written out with writeStream. Question 49. Which of the following is NOT a valid way to pass parameters to a Databricks Job notebook task?
A) Using the “Base parameters” JSON field in the job UI B) Adding %run ./other_notebook with arguments at the top of the notebook C) Accessing dbutils.widgets.get("param_name") inside the notebook D) Setting environment variables via spark.conf.set before the task runs Answer: B Explanation: %run with arguments is a notebook-to-notebook inclusion technique, not a way to pass parameters from a Job task; the Job UI’s parameter fields and widgets are the correct mechanisms. Question 50. When would you prefer using mlflow.spark.log_model() over mlflow.pyfunc.log_model() for a Spark ML pipeline? A) When you need to serve the model via a REST API B) When you want to preserve Spark-specific metadata and enable direct Spark loading C) When the model is written in TensorFlow D) When you plan to export the model to ONNX Answer: B Explanation: mlflow.spark.log_model stores Spark-specific metadata, allowing the model to be loaded back as a Spark PipelineModel without conversion to a generic pyfunc wrapper. Question 51. Which of the following is the most appropriate way to handle a categorical feature with 10,000 distinct values? A) One-Hot Encode the column B) Use StringIndexer only C) Apply Target Encoding or Hashing Trick D) Drop the column entirely Answer: C Explanation: One-Hot Encoding would create 10,000 columns, which is impractical. Target encoding or the hashing trick reduces dimensionality while preserving predictive power.
Question 55. Which of the following statements about mlflow.sklearn.log_model() is FALSE? A) It stores the model in a format that can be loaded with mlflow.sklearn.load_model B) It automatically logs the model’s conda environment if conda_env is provided C) It creates a generic pyfunc model wrapper under the hood D) It registers the model in the Model Registry automatically Answer: D Explanation: Logging a model does not automatically register it; registration must be performed explicitly via the Model Registry API. Question 56. In a Databricks notebook, which command creates a new widget for user input? A) dbutils.widgets.text("name", "default") B) dbutils.widgets.create("name", "default") C) %widget name default D) spark.sql("CREATE WIDGET") Answer: A Explanation: dbutils.widgets.text (or dropdown, combobox) creates a UI widget that can be accessed with dbutils.widgets.get. Question 57. Which of the following is a valid reason to use a “single-node” cluster for training a PyTorch model? A) The model requires distributed data parallelism across many GPUs B) The dataset fits entirely in the driver’s memory and the model does not need Spark’s distributed capabilities C) You need to run a Spark SQL query before training D) You plan to serve the model using Spark Structured Streaming Answer: B Explanation: Single-node clusters are suitable for deep-learning workloads that run on a single GPU/CPU and do not benefit from Spark’s distributed processing.
Question 58. What does the mlflow.search_runs() function return? A) A list of model versions in the registry B) A DataFrame containing runs that match the provided filter criteria C) The best run based on a specified metric D) A dictionary of all logged artifacts for a run Answer: B Explanation: mlflow.search_runs queries the tracking server and returns a Pandas DataFrame with runs that satisfy the filter string. Question 59. Which of the following is the correct order of steps when building a Spark ML Pipeline for a classification task? A) VectorAssembler → StringIndexer → StandardScaler → LogisticRegression B) StringIndexer → VectorAssembler → StandardScaler → LogisticRegression C) StandardScaler → VectorAssembler → StringIndexer → LogisticRegression D) LogisticRegression → VectorAssembler → StringIndexer → StandardScaler Answer: B Explanation: Categorical columns must be indexed first (StringIndexer), then assembled into a feature vector (VectorAssembler), optionally scaled (StandardScaler), and finally fed to the estimator (LogisticRegression). Question 60. Which of the following is NOT a built-in Spark ML transformer? A) Bucketizer B) Imputer C) FeatureHasher D) HyperoptEstimator Answer: D Explanation: HyperoptEstimator belongs to the Hyperopt library, not to Spark ML’s transformer/estimator hierarchy. Question 61. When using Delta Live Tables, what does the EXPECT clause do?